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NOTES ON INFERENCE BASED ON DATA 
FROM COMPLEX SAMPLE DESIGNS 


Gad Nathan! 


The problems associated with making analytical inferences 
from data based on complex sample designs are reviewed. 
A basic issue is the definition of the parameter of inter- 
est and whether it is a superpopulation model parameter or 
a finite population parameter. General methods based on a 
generalized Wald Statistics andits modification or on mod- 
ifications of classical test statistics are discussed. 
More detail is given on specific methods-on linear models 
and regression and on categorical data analysis. 


1. INTRODUCTION 


Standard methods of inference, such as regression, analysis of vari- 
ance or tests of independence, are, in general, based on the assump- 
tion that the data are obtained by simple random sampling from an 
infinite population with a probability distribution belonging to some 
hypothetical family. The wide dissemination of standard computer 
packages has made the use of these methods extremely easy. However 
standard methods cannot usually be simply applied to data from complex 


sample designs without any modification. 


In the following we attempt to provide a selection of some practical 
hints on what can be done andof some warnings against what should not 
be done in these situations. This is based on the selected list of 
references to recent work in the area, which include many examples of 


applications. 


The first question which must be answered by anyone who intends to 
carry out statistical analysis is what exactly are the parameters 


about which inference is required. 


Me. Nathan, Hebrew University, Jerusalem and Isreal Central Bureau 
Ht statistics 


ear tel Osc. 


One of two extreme answers to this question is often given (Brewer and 
Mellor (1973); Smith (1976)). One, as advanced for instance by Kish 
and Frankel (1974), considers that the only relevant inference concerns 
finite population parameters, such as the population’ regression 
coefficient: 

N N 
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i=] i=] 
similarly defined multiple or partial correlation coefficients or other 
measures, defined with respect to the finite population only, withno 
recourse to any superpopulation model. Inference in this case would 
usually be design-based (Sarndal (1978)), that is based only on proper- 
ties of the sample distribution. However model-based inference about 


a finite population parameter is also possible- (Hartley and Sielken 


(1975))® 


The other extreme position, as stated, for instance, by Fienberg (1980), 
considers all inference as relating tothe parameters of a probability 
distribution (a superpopulation) of which the finite population re- 
presents a realization. Examples of such inference can be found in 
Konijn (1962), Fuller (1975), Thomsen (1978) and Pfeffermann and 
Nathan (1981). If the parameters about which inference is made relate 
to a superpopulation model, design-based inference cannot be used 
alone and inference must be model-based, Sarndal (1978), or jointly 
model- and design-based. Under assumptions of independence between 
the model distribution andthe sampling distribution, standard (model- 
based) inference is valid and the sample design only affects the 


efficiency of inference. 


Serious objections can be raised with respect to eachof these extreme 
approaches. Model-based inference relies heavily on assumptions about 
a theoretical model which are usually difficult to ensure and the in- 
ference will not, in general, be robust to departures from this model. 


On the other hand, the finite population parameters, on which design- 


i= 


based inference is made, are usually ''copies'' of theoretical model 


parameters with little descriptive value in themselves, unless some 
basic model is assumed. For instance, a finite population correlation 
coefficient is a useful measure of the relationship between two vari- 


ables only if the relationship is approximately linear. 


In many cases some balance between these approaches may be preferable. 
This can be attained, for instance, by considering as the objects of 
inference only finite population parameters which closely approximate 
superpopulation parameters of a suitable model,to which the data fit. 
For instance, if separate regression equations are fitted to relevant 
sub-populations a better linear fit may be obtained than from an over- 
regression. If the sub-populations are large enough this will ensure 
that the finite population regression coefficients closely approximate 
the superpopulation parameters, so that any inference relating to the 
finite population parameters can _ be considered as relating to the 


Superpopulation parameters. 


To ensure close correspondence between model parameters and finite 
population parameters extensive exploratory analysis to check the 
model should be carried out,before entering into any formal analysis. 
This analysis to explore various alternative models can often be based 
on simple descriptive measures for which the sample design can be 
taken into account or on graphical displays. However the results have 
to be carefully interpreted in the light of the sample design. For 
example, a few large residuals with small sample weights may be much 
less important than many smaller residuals with large weights. A use- 
ful diagnostic tool to consider in the case of regression is the dif- 
ference between a weighted and an unweighted regression coefficient. 


A large difference will often indicate that the model is inadequate. 


Once the parameters have been determined,we should consider what type 


of inference is required (point estimation, interval inference or tests 


of hypotheses). While point estimation and confidence intervals would 
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be most appropriate for finite population parameters,tests of hypoth- 
eses, and in particular simple hypotheses, are strictly relevant only 
with respect to superpopulation parameters of a well-defined model. 
For example the hypothesis that two domain means are equal can only be 
seriously entertained with respect to the superpopulation means rather 
than their finite population realizations. If one wishes to avoid the 
formulation of a model it would be preferable to use point estimation 
or confidence intervals for the difference between the domain means 
rather than tests of hypotheses. If hypothesis testing about finite 
population parameters is required,testing acomposite hypothesis (e.g. 
that the difference between the means is in a given range of values) 
would be more appropriate than testing the simple hypothesis(that the 
difference is zero). Note that for sufficiently large samples, any 
non-zero difference, no matter how small, will be found significantly 


different from zero. 


in the following, we discuss some basic general methods of analysis of 
data from complex sample designs and some specific methods for linear 
models and for tests of goodness of fit and of independence in contin- 
gency tables. In general we shall consider the inference as relating 
to finite population parameters. However we consider this inference 
as relevant only if the finite population parameters closely approxi- 
mate superpopulation model parameters. This leaves open the possibil- 
ities of tending either towards a purely design-based approach or to- 
wards a purely model-based approach, according to one's personal de- 


gree of belief in the validity of an underlying model. 


2. BASIC GENERAL METHODS 


2.1 Generalized Wald Statistic 


If the hypothesis to be tested is linear (or can be linearized) 
in the expected values of asymptotically normal statistics, for which 
a consistent estimator of the variance matrix is available, the gen- 


eralized Wald Statistic can be used (Grizzle, Starmer and Koch (1969)), 
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Koch, Freeman and Freeman (1976), Freeman, Freeman, Brock and Koch 
(1976), Shah, Holt and Folsom (1977) and Koch, Stokes and Brock (1980) ). 


We assume that we wish to test the hypothesis: 


Hy? Xe 84° G2. abr Le) 


where X is a known rxp design matrix of full rank. 8B is a_ px!i 
unknown parameter vector (either finite population parameters or super- 
population parameters) and 8 is a known rxl vector of constants. In 
case the hypothesis is not lineara first-order Taylor series approxi- 


mation can be used (Nathan (1972) and Shuster and Downing (1976)). 


Aa 


We assume that a consistent asymptotically normal estimator 8, 
of 8 is available, as well as a consistent estimator, te of thes tcov= 


ariance matrix of B, whose distribution is independent of that of 8. 


Then the generalized Wald Statistic, defined as: 


2 2s A -| A 

a a i t = 
Ke (XB 64) (XX?) Fae (X6 6) (2%ha2) 
is asymptotically distributed, under the null hypothesis, as chi- 
square with degrees of freedom equal to the dimension of the hypoth- 


esis (p-r). 


The consistency of B and of V and the asymptotic distributions of 
B and of x" can all be considered with respect to the sampling distri- 


bution or with respect to the superpopulation distribution. 


The major problem associated with this approach is in obtaining 

the consistent estimator, V, of the covariance matrix when 8 is non- 
linear in the sample observations (as will often be the case). Rao 
(1975) surveys the various methods of variance estimation which can be 
used: linearization (Tepping (1968)); Balanced Repeated Replication 
(McCarthy (1969)); and Jackknife (Miller (1974)). Several general comput- 
er programmes are available for their implementation - e.g. SUPERCARP 
(Hidiroglou, Fuller and Hickman (1980)), SUDAAN (Shah (1978)) for 


aah 


linearization and OSIRIS IV: PSALMS for balanced repeated replication. 
A complete listing and comparison of programs is given by Kaplan, 


Francis and Sedransk (1979). 


Empirical comparisons of the variance estimators are given by Kish 
and Frankel (1974) and by Richards and Freeman (1980) and theoretical 


comparisons by Krewski and Rao (1981). 


However, attention should be given to the stability of the variance 
estimator, especially when the number of parameters is large. In 
addition, care must be taken with respect to the conditions under 
which consistency and asymptotic properties hold for complex designs. 
For instance, for a two-stage design asymptotic results may require 


both a large number of PSU's anda large number of final units per PSU. 


2.2 Approximation and Modelling of the Covariances 


The practical difficulties involved in obtaining a stable consistent 

estimator of the covariance matrix have led to attempts to use simp- 

lified approximations to such estimators. The basic idea is that 

by assuming some structure for the covariance matrix, more stable 


estimators of fewer parameters can be used. 


The approximation can be carried out under a pure desiqn-based 
approach, directly with respect to the covariance matrix. If assump- 
tions can be made on equality of design effects for variances and 
covariances within a given sub-group of parameters,overall] estimators 
of covariance can be used. This approach is used, for instance, by 
Nathan (1973), Fuller and Rao (1978), Fellegi (1980) and Lepkowski 
and Landis (1980). 


Alternatively modelling of the population structure itself can 

lead to simplified covariance matrices which can easily be estimated 
(see, e.g., Altham (1976), Fuller and Battese (1973), Tomberlin (1979), 
Holt, Richardson and Mitchell (1980), Imrey, Sobel and Francis (1980) 
and Pfeffermann and Nathan (1981)). 


- 115 - 


2.3 Modifications of Standard Tests 


The widespread use of standard computer packages has encouraged 

the search for simple modifications to standard test procedures to take 
into account complex sample design. The idea can be regarded as a 
natural extension of the use of design effects as multiplicative factors 
for variances based onasimple random sample of the same size,in order 


to correct for the complex design used. 


The correction may indeed be based on design effects of various 
estimators or on average design effects -(see, e.g., Cowan and Binder 
(1978), Fay (1979), Fellegi (1980) , Rao and Scott (1981) and Scott 
and Holt (1981) . 


Another alternative is to investigate the behaviours of standard 
test statistics under some superpopulation model and to modify the 


Standard statistic accordingly (Cohen (1976) and Campbell (1977)). 


3. SPECIFIC METHODS 


3.1 Linear Models and Regression 


The prior determination of the mode | and of the parameters of inte- 
rest is extremely important for the case of regression analysis and 
of linear models. For instance, when different regression relation- 
ships must be assumed for different strata or for different PSU's ina 
two-stage design, the parameter of interest could be a simple average 
of the regression coefficients (Konijn (1962)); a weighted average of 
the coefficients (Pfeffermann and- Nathan (1981)); or their expected 


value (under some prior distribution) (Porter (1973)). 


The model and the parameters of interest should, in general, be 
determined on the basis of the assumed overal! population structure and 
should not reflect to the structure of the sample design. However in 


many cases the sample design will reflect population structure so that 
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sample design variables may be part of the model. For example consider 


the model: 


E(Y|X,,X,) Sek aby hi! taXhe Baer (Gal: clr) 


where xX, includes only variables which do not’ relate to the sample 


design and X, includes all the variables which enter into the complex 


sample design, i.e. the sample distribution depends only on Xo: 


P(s|X,,X,) = P(s|X,). (Slee) 


The estimation of aa and of Bo | in (3.1.1) and inference 
about them can proceed in the classical way, as if sampling were 


simple random, if indeed (3.1.1) holds. 


However if the design variables, Xo» are not included in the 


regression equation of interest: 
E(Y|X,) = XB, Chaps), 


and the design variable X, is correlated with Y (conditional on 
x1) then the standard OLS estimator of By is not consistent: (see 
Nathan and Holt (1980) and Holt and Smith (1979), who propose modified 
weighted and unweighted estimates of By» which are consistent). Holt, 
Smith and Winter (1980) give an example of the application of these 


estimators. 


if the linear model: 


xa (324) 


E(Y. |x.) 


9} ee 
of i=j 
cov(Y; 5¥5|x;4x;) Ne fj (Binld, 5) 
indeed holds for all population units (i, j=l, ..., N) of a finite 
population and the pxl column vector x; includes all the sample design 
variables, then the OLS unweighted estimator: 


poe U -]| 
Sea hele ce lip ry fog (3.1.6) 


aati] = 


i] 
i, x) and oe = (Y Ae Sas ia 


based on the sampled values i = (x ote 
n 


is the "'best'' linear model-unbiased estimator of 8 irrespective of 
the sample design. "Best'' here is in the sense of minimal model - 
variance. However B is, in general, not a design-unbiased, nor even 
a design-consistent, estimator of the population parameter: 


_ 1 -] i] 
B = (X\ Xv) A (Sau FZ) 


i] 
where Xy = (x, oo ds. MJ eandV¥2-=—(Y 


pxN N N ] N 


The design-consistent estimator of B is the weighted estimator: 


A 


es t -| 1 
By = (x, We x) x We \enore (20a) 


where the weight matrix, bee diag (11, see ee ie is. thes nxn 
diagonal matrix of the reciprocals of the sample inclusion probabili - 


ties Las Prunes). 


The consistency of ae as an estimator of B, obviously does not 

depend on the model (3.1.4) holding, but the relevance of estimating 
B when the model does not hold can be challenged. It can be shown 
that under certain conditions for a non-linear model, which assumes 
that the conditional expectation of Y (given X) is a differentiable 
function of X, the model-expectation of B can be expressed approxi- 
mately as a weighted average of the slopes of this function at the 
points x. (the weights depending only on x. -X). However this inter- 


pretation is of limited practical value. 


a 


In any case By is a model-unbiased estimator of 8, whenever 
(3.1.4) does hold. it will not, in general, be an optimal estimator 
of 8 under (3.1.5) for unequal probability sampling, but will be so 


if the conditional model variance of Y; is proportional to Te, 
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ives MOY |= kIT (3.1.9) 


a 


Since the weighted estimator, 8\), is more robust then the un- 

weighted estimator, B, in the sense that it is both a model-unbiased 
estimator of Bg, if the model holds and a design-consistent estimator 
of B, if not, the use of the weighted estimator By is recommended, for 
estimation of B, whenever there is no assurance that the model (3.1.4)- 
(3.1.5) holds. The question which must then be answered by the subject- 


matter specialist is whether B is a relevant parameter to estimate. 


It should be noted that for self-weighting designs 8 and EF coin- 

cide. The estimator, By (3.1.8) ,can be obtained directly from standard 
computer programmes which provide for weighted regression (e.g.BMDP) by 
using the weights V/T., 5 or from other programmes (e.g. SPSS) by carry- 
ing out unweighted regression on the transformed variables Y./vil, and 
x./ VI. but not on the weighted variables Y./T;» x, /TL.. However, it 
should be noted that under either alternative the reported variances 
and covariances of the estimators are incorrect and that the standard 
significance tests (e.g. F tests) are invalid, and can result in gros- 


sly misleading conclusions. 


Assuming the model (3.1.4) - (3.1.5), the model variance of 8 is: 


V(B[X,) = 02 (X, Kd (3.1.10) 


which is the result given by standard unweighted regression programmes. 


However, the model variance of By iss 


an Os <- a te aye ant 
ul Sn) = ogelx VW EX) OTE 6: Hine a W, x) (39725 19) 


The weighted regression programme, with weights \/T, , will give 


t _ A 
a value of (X_ W, x) for the model variance of By which equals 
(3.1.11) only if We = Hors Thus none of the standard outputs for stan- 


dard errors or for tests of hypotheses are correct. 
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However the estimator of the multiple correlation coefficient obtained 


from weighted regression: 


i “aw 
ey ara) WY -% 96) 
Rink n n W im da 22 W (3.1.12) 
iM ey 3)L.) WAN city ay lhe) 
where Y, = (x. Y./1;) ih (Zr. I/t.), is a design-consistent estimator of 


the population multiple correlation coefficient: 


(Y= XB)! (Yu -Xy 8) 
oo tee (3.1.13) 


where Y, = (1/N) 1, Y,. 


The design-variance of 8 


a 


Ww? which must be considered the relevant 
measure of accuracy for Bw as an estimator of B, cannot in general, be 
obtained from only the first order inclusion probabilities, I For 


most sample designs used in practice, the design-variance of will 


g 
W 
have to be estimated by one of the variance estimating techniques mention- 
ed above i.e.linearization, Balanced Repeated Replication or Jackknife 


s 


(see, e.g., Jonrup and Remmermalm (1976) and Holt and Scott (1981)). 


3.2 Categorical Data Analysis 


The simplest analysis of categorical data relates to a single classi- 
fication of the population into k classes with probabilities 

i 
(relative frequencies) p = (P,> De Py-y)* In order to test the 


null hypothesis of goodness of fit to a known distribution 
i 
Bo.” (Poy Hh ia Pok-1): 
, (2a) 


the approaches outlined in section two can be used. 


~ 1 “aw nw 
We assume that a consistent survey estimator p = (p> oor Py-y) 


1 
of p is available. If it is asymptotically normal: 


a 20S 


vn (p-p) ~> N(0,V) (See) 


and a consistent estimator, V, of V is available, then the generalized 


Wald statistic: 


2 ‘ ies 8 peers 
Ky = n(p= pea ; (p- p,) > (332.53) 


which is distributed asymptotically as xy under Ho? can be used 


16) {EOSIG Inhic 
(e) 


For many simple designs consistent estimators of V are directly 
available and for more complex designs they can be obtained by standard 
methods. However if tests of hypotheses of goodness of fit have to be 
carried out for a variety of variables and classifications, the use of 
the standard xX statistic: 


a 


(p, “poy dpa = n(B- po) po! (p Pa) > ve 043.204) 


k 
XeE=Tn 2S 
t= 


] 

where P. = diag (p.) - Po Ps , with appropriate modification may be pre- 

fered. Rao and Scott (1981) show that the asymptotic distribution of 
D 

X under Ho is that of a weighted sum of k-1 independent x? variables 


with one degree of freedom each. 


ey se 2 
X~> ZA. 253 Zz. ~N(0,1) independent (3.2.5) 
i=] 
where djs ses yay are the eigenvalues of 
gee 
Dean (A, 2 Xo Bee SA ? 0). (3.2.6) 


A conservative test of (3.2.1) can then be obtained by using the 
2 
statistic xX /r. in conjunction with a Xe distribution. dy can be 
components of p. For example, for proportional stratified sampling 
Dies ; Se, 
dy <1, so that X itself can be used as a conservative test statistic. 


In other cases the use of x*/5 with: 


podend kel Ak 
ee | 
i=] i= 


d, (1 -p.) > 

where d.=V[p.]/[p, (1-p,)] is the design effect for Ps has been shown 
to be a good approximative test by Hidiroglou and Rao (1981) for the 
Canada Health Surveys and by Holt, Scott and Ewings (1980) for large 


scale U.K. surveys. An alternative approximation - X /d, where 


d=k Deed has been proposed by Fellegi (1980). 


Direct modelling for p has been proposed by Altham (1976) and by 

Cohen (1976), but their models have the serious limitation that they 
imply Ay = ho Side = Any = which is equivalent to a constant de- 
sign effect over categories. This is not a realistic assumption, in 
general, and results in X IX having exactly an asymptotic Xp] dtstri-~ 


bution. 


For testing independence in a two-way contingency table, the hypo- 


theses can be formulated: 


Ho? h; ; (p) =p.. - 


istin. eeral stole ee ec ib) C3327) 


where Pah is the population probability of cell (i,j) Pry he are 


the marginal probabilities and p = (Pi )> 5: § tee ). The generali- 


ros 
zed Wald statistic for testing He rs: 


2 1 | A 


Xe mib(p) ly; Veo hKp)y, (3.2.8) 

a q aS A é ; 
where [h(p)] = [h,, (p), veer Me Ly (p)] and V|/n is a consistent 
estimator of the covariance matrix of h(p). Versions of (3.2.8) for 


specific designs with various methods for estimating Vay have been 
used by Garza-Hernandez and McCarthy (1962), Nathan (1969, 1975) 
Shuster and Downing (1976) and Fellegi (1980). 


SE 22) 


A modified statistic similar to ae has been proposed by Rao and 


Scott (1981): 


2 Ree ak Se oe Ye ee 
Xe, = (n/8) a re (P; Pig Paz) / (Pi, Pyj)> (3.2.9) 
A 1 ER een pags 
where Foti een nasa ne tee Cig has tn) and 


v..(h)/n is an estimator of the variance of h; ; (p)- § can be written 


in terms of the estimated deffs of h; ; (B): 


Coc A A - 
we | remsrMcc Pie =P eae iges (3.2.10) 
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i=] j=l 
where 5, is an estimator of the deff, a of hy (p) 
65 = nV[h; 5 (p)] / [p., Pai Crap al - Py) (Biel) 


Estimates of the design effects may be easier to obtain than estimates 


of variances. 


Empirical investiqations by Holt, Scott and Ewings (1980) and by 


is 


2, 
Hidiroglou and Rao (1981) indicate that the distribution of Xe 
2 


close to X(r-1) (c-1)° 


3.3 Other Types of Analysis 


While linear models, tests of goodness of fit and tests of indepen- 
dence cover many important analysis applications, other types of 
analysis, such as principal component and factor analysis,discriminant 
analysis, path analysis, logistic regression, log-linear models non- 
parametric methods, etc. cannot be directly dealt with in the same 


way. While the general techniques outlined in section two could be 


| 
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used, their application presents difficulties and only few cases of their 


application have been reported. 


Since correlation coefficients are a basic element in most multivariate 
analysis, some empirical studies of the effect of sample design on their 
estimation have been carried out by Kish and Frankel (1974), Bebbington 
and Smith (1977) and Holt, Richardson and Mitchell (1980). ‘No general 
conclusions can be formulated, but design effects are definitely not 
negligible. Bebbington and Smith (1977) have also studied the sampling 


variability of principal components estimators. 


In other areas design effects for logits have been studied by Lepkowski 
and Landis (1980) and confidence intervals for quantiles by Woodruff 
(1952) and by Sedransk and Meyer (1978). 
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THE NONRESPONSE PROBLEM 


J.G. BETHLEHEM AND H.M.P. KERSTEN | 


This paper presents an outline of the nonresponse research 
which is carried out at the Netherlands Central Bureau of 
Statistics. The phenomenon of nonresponse is put into a 
general frame-work. The extent of nonresponse is indicated 
with figures from a number of CBS-surveys. The use of 
auxiliary variables is discussed as a means for obtaining 
information about nonrespondents. These variables can be 
used either to characterize nonrespondents or as strati- 
fication variables in adjustment procedures. 


Adjustment for nonresponse bias by means of subgroup 
weighting is considered in more detail. Finally, the last 
section lists a number of other methods which also aim at 
reduction of the bias. 


1. INTRODUCTION 


Nonresponse is becoming a growing concern in survey research. The 
phenomenon of nonresponse, when people are not able or willing to answer 
questions asked by the interviewer, can appear in sample surveys as wel] 
as in censuses. It affects the quality of the survey in two ways: first 
of all, due to reduction of the available amount of data, estimates of 
population parameters will be less precise. Secondly, if a relationshi> 
exists between the variable under investigation and response behaviour, 
statements made on the basis of the response are not valid for the total 
population. For example if the housing demand of respondents is greater 
than the housing demand of nonrespondents, estimates of the housing demand 


in the total population will be significantly too high. 


J.G. Bethlehem and H.M.P. Kersten, Netherlands Central Bureau of Statistics. 
The views expressed in this paper are those of the authors and do not 
necessarily reflect the policies of the Netherlands Central Bureau of 
Statistics. 
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It is obvious that the extent of the nonresponse must be kept as smal] 

as possible. If, in spite of these efforts, there still remains a consi- 
derable amount of nonresponse, measures have to be taken in order to prevent 
formulation of wrong statements about the population. Combination of 
adjustment procedures and usual estimation techniques is necessary to 


yield valid population estimates. 


Two departments of the CBS (Netherlands Central Bureau of Statistics) 

are involved in nonresponse research. The Department for Social Surveys 

is responsible for the field work of the Surveys. It is concerned with 
minimizing nonresponse during the process of collecting data. Research is 
carried out on the optimal number of recalls and the time of the interview. 
(See Widdershoven & Van den Berg (1980).) Experiments are set up to find 
the optimal way to approach persons and households with introductory 
letters. Attempts are made to measure the impact of interview fatigue and 
interview pressure. Ultimately, notwithstanding these efforts, there still 
remains an amount of nonresponse. The Department for Statistical Methods 
investigates the effect of nonresponse on the accuracy of the results of 
the survey. Methods are developed there to adjust population estimates for 
the bias due to nonresponse. The remainder of this paper is mainly con- 


cerned with the work of the latter department. 


The next sections present an outline of the nonresponse analysis at the 
CBS. Section 2 introduces definitions and the accompanying problems. 
Nonresponse figures of a number of CBS-surveys are summarized. In section 
3 graphical methods are discussed to select auxiliary variables. They 
provide insight into nonresponse and can be used in adjustment procedures. 
Section 4 presents adjustment methods which make use of subgroup weighting 


and section 5 lists a number of other methods. 


2. THE PHENOMENON OF NONRESPONSE 


In this section the problem of nonresponse is placed in a general frame- 


work, in which also a number of other sampling problems play a role. 
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Nonresponse figures for a number of CBS surveys are given. Situations are 
described in which a relationship exists between the variable under investi- 
gation and the response behaviour. In the last part of the section two 


models for the general of nonresponse are considered. 

2.1 Terminology 

The objective of every survey is the determination of certain population 
characteristics. Due to all kinds of errors, the true value will generally 
never be obtained. A typology of sources of error is presented in fig. 1. 


The scheme is due to Kish (1967). 


PUG si. tty POLOGY OR ERRORS =I Ny SURVEYS 


sampling error) 
observation error ) 
measurement error) 


( nonsampling error 
nonobservation error 
\ processing error ) \ noncoverage nonresponse 


The two sources of error in surveys are sampling errors and nonsampling 
errors. Sampling errors consist of that part of the error which is due 
to the fact that only a sample of values is observed rather than the 
total population. The sampling error has an expected frequency distri- 
bution generated by the totality of sampling errors in all possible 
samples of the same size. This distribution is used to estimate the 


population characteristic. 


* D3 e 


Nonsampling errors are those errors in sample estimates which can not be 
attributed to sampling fluctuations. Nonsampling errors are often a more 
serious problem than sampling errors. Nonsampling errors can be divided in 


observation errors and nonobservation errors. 


Observation errors are caused by obtaining and recording observations 
incorrectly. They may be further subdivided into measurement errors and 


processing errors. 


Measurement errors are caused either by the interviewer or by the respon- 
dent. The interviewer himself can be a source of error. He can influence 
the response by his mere presence, by his (or her) sex, skin colour, age, 

or dress. Also the way in which he asks questions and clarifies statements 
affects results. The answer of a person may depend on the type of question 
(whether a question measures a fact such as year of birth, or an opinion). 
Errors can also be introduced by factors such as whether the person under- 
stands the question, whether he knows the answer or not, whether he wishes 
to conceal the answer, or whether he wishes to present a certain image. 
Moreover, memory is not always free of errors, and data may be incorrectly 


recorded. 


Processing errors arise during the processing of the data at the office. 


They occur during the stage of coding, tabulating and computing. 


Nonobservation errors are due to the failure to obtain observations on 
certain parts of the population. They may be subdivided in noncoverage and 


nonresponse. 


Let the target population be the population the survey is intended to cover. 
Practical difficulties in handling parts of the population may result in 
their elimination from the scope of the survey. It is also possible that 
the actually sampled population contains elements which do not belong to the 


scope of the survey. 
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Noncoverage refers to all errors which result from differences between 
target population and sampled population. Elements which belong to the 
target population as well as to the sampled population are correct elements. 
The situation in which elements in the target population do not appear in 
the sampled population is called undercoverage. These elements have zero 
probability of selection in the sample. The situation in which elements 

in the sampled population do not appear in the target population is called 
overcoverage. Elements, classified as overcoverage, are called duds. They 
have to be excluded from the sample before analysis takes place. If there 
is unexpected overcoverage the ultimate sample size may be less than the 


planned sample size. 


Nonresponse refers to failure to obtain observations on some elements selec- 
ted and designated for the sample. A good classification of nonresponse 
errors depends on the survey situation. The classification given below 
focuses on problems in face-to-face interviews. A similar treatment may be 
applicable in other survey situations. The following categories of nonres- 


ponse can be distinguished: 


(1) Not at home. To reduce the extent of this category recalls can 
be made. Research should be carried out on the optimal number of 
recalls. The term temporarily unavailable would be a useful gener- 
alization for this category, denoting a delay rather than a denial 
of the interview. The respondent may be too busy, tired, or il] 


at the time, but will be cooperative on another call. 


(2) Refusal. Some of the factors causing refusal are temporary and 
changeable. A person may refuse because he is il!-disposed or 
approached at the wrong hour. Another try, or another approach may 
find him cooperative. Since quite a number of refusals can, however 
be considered permanent, a better term for this category is unob- 
tainable,, denoting a denial rather than a delay of observation. 
Repeated attempts will not bring success. From this view, respon- 
dents known to be away during the entire survey period belong in 


this category, rather than among the not-at-homes. 


Soa 


(3) Incapacity or inability. This type of nonresponse may refer to 
mental or physical illness which prevents response during the entire 
survey period. A language barrier belongs also to this category. 

If generalized this category could fit in the previously defined 
unobtainables. It can, however, be useful in some situations to 
distinguish between the unwilling and the willing, but incapable, 


respondent. 


(4) Not found. This category can e.g. be large for movers. Such res- 
pondents are either not identified or followed because this would 
be too expensive. Cases of not attempted interviews belong to the 
same general category. They could be caused by inaccessibility 
(lighthouse keeper, shepherd), or dangerous surroundings (watchdog, 


slum). 


(5) Lost information. Information may get lost after a field attempt. 
Some questionnaires may be unusable because of poor quality or 
cheating. Other may remain unfilled because they were lost or 


forgotten. 


The typology as described above is applicable in most survey situations, 
but care must be taken in case of complex sampling designs. When e.g. 
sampling takes place in more stages the typology can be used in each sepa- 
rate stage. The same source of error can be classified differently in 
different stage.. This is illustrated in an example. In a household survey 
first a sample of households is selected. The interviewer enumerates al] 
persons in a particular selected household and after that selects a sample 
from this list. In such an enumeration the student living in an attic is 
often concealed. In the first stage of the sampling procedure this situ- 
ation would be classified as measurement error, and in the second stage as 


undercoverage. 


For some sources of error classification may depend on other factors and 
appropriate rules to cover them must be adopted. For example, if a person 


to be interviewed died before the interview could take place, classification 
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depends on the time of death. If death occurred before the day the sample 
was selected this could be classified as overcoverage, but if death occurred 
between the day the sample was selected and the day of the interview, the 


correct classification may be nonresponse. 


Before selecting the sample, the population must be divided into sampling 
units. To every element in the population there must correspond one and 
only one sampling unit. The construction of the physical list of sampling 
units, called the sampling frame, is often a major practical problem. The 
nature of the available sampling frames is an important consideration in 
sample design. Relevant factors include the type of sampling unit, extent 
of coverage, accuracy and completeness of the list, and the amount and 


quality of auxiliary information in the list. 


For sampling frames in which the sampling unit is a person the CBS has to 
restrict itself to administrative records of local authorities (municipali- 
ties). For household surveys the CBS manages its own frame, but at the 
moment the use of the list of delivery points of the Post Office is consi- 


dred as a sampling frame. 
2.2 The Extent of Nonresponse 


lt is rather difficult to compare nonresponse figures of different surveys. 
The percentage of nonresponse depends on a number of circumstances: aim of 
the survey, type of sampling unit, the sampling design, efficiency of the 
field work, performance of the interviewers, nonresponse reducing measures, 
perdiod in which the survey is held, the target population, the length of 
the questionnaire, wording of questions, etc. Even the definition of non- 
response may differ. It is necessary to create a frame-work which enables 
proper comparison of surveys. By controlling the factors which influence 
nonresponse figures, judgement can be passed on the quality of the different 
surveys. Such a frame work also offers opportunities for comparing surveys 


from different countries. 
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Table 1 presents nonresponse figures of a number of CBS-surveys. A clear 


trend of increasing nonresponse percentages can be seen in this table. 


Table 1: Nonresponse percentages of some CBS-surveys 


eae? Pen arr tn rn tn = rn tn me rn tn : rn 
O97 St ies 3.2 
1974 Zo22e 520 
P9750 soe 9.0 30 al 1o.3 14.5 
1976 Bal a 1Sabiw 23.05) 15:6 12.9 
Pe77) Ure. hee Georeneolg! m2085) (929 4/iorl los 9 17.6 9:3 
1978 36m 123-9 B30 126.2 ee Zio o eS 
1979 19.7 Wi lon. See 30 Gamez sq be 
1980 BONS? 2heefoGe 85.63 y 19). 7abu 2s Mia S 
eee ee ee eee 
1) = elderly people only LFS = Labour Force Survey 
2) = young people only SSC = Survey of Consumer Sentiments 
tn = percentage of total nonresponse SLC = Survey of Living Conditions 
rn = percentage of refusals NTS = National Travel Survey 
HS = Holiday Survey 


As mentioned before a relationship between the variable under investigation 
and the response behaviour reduces the value of the conclusions of the 
survey. The existence of such relationships is not rare, as will be illus- 
trated in the following examples. If the aim of the survey is to measure 
in which way people spend their spare time, then the reason of nonresponse 
"not at home'' is rather annoying since these people are probably spending 
their (spare) time somewhere else. The same applies for the survey on the 
number of hours people watch television: the not-at-homes (in the evening) 


are probably not watching television. One of the aims of the Housing 
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Demand Survey is to measure the frequency with which people move to other 
houses. As there is a considerable amount of nonresponse due to moving 
(the sampling unit is a person), the estimate for the total population 
will be biased. A number of surveys show that unmarried people have a 
smaller response rate. If there is a relationship between marital status 
and the variable under investigation then estimates will be wrong in this 


case too. 
2.3 Response Models 


The first requirement in the development of theories for the treatment of 
nonresponse is the formulation of a mathematical model, which describes the 
way in which nonresponse is generated. Two models appear frequently in the 
literature. They are denoted here by ''random response model'' and ''fixed 


response model"'. 


According to the random response model every element in the population has 
a certain (unknown) probability of response. These response probabilities 
are not necessarily the same for every element. When the interviewer 

‘contacts the person to be questioned the probability mechanism is activated 


and determines whether or not the person responds. 


The fixed response model assumes the existence of two strata in the popu- 
lation: a stratum of potential respondents and a stratum of potential non- 
respondents. Size and content of each stratum is not known beforehand. 
They are determined by the specification of the survey (aim, type of ques- 
tions, interviewing techniques, interviewers, period of field work, ) etes ). 
Disregarding the two strata a sample is selected from the population. 
Consequently the number of respondents is a random variable in both the 


random response model and the fixed response model. 


if instead of sampling complete enumeration would take place then in the 
case of random response model the determination of respondents would 

still be a random process whereas in the case of the fixed response model 
this would be fixed. There is, however, a certain resemblance between the 


two models. Assuming the existence of two stochastic meachanisms, the 
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sampling mechanism and the response mechanism, both models differ only in 
the order in which the mechanisms are applied: In the fixed response model 
first the response mechanism is activated for each element in the population. 
This determines the two strata. Then the sample is selected. In the random 
response model first the sample is selected. Then the response mechanism is 


activated for each selected element. 


The random response model offers the opportunity to estimate response prob- 

abilities. These estimated response probabilities can be used in adjustment 
procedures, or they can be connected to personal characteristics. The fixed 
response models generally results in easier formulae. The theory, developed 
within this model, is conditional on the realized response and non-response 
strata. Consequently the accuracy of the estimates can be computed, but the 
accuracy of the estimation method can not be determined. Due to this last 


argument research is focussed on the random response model. 


3. SELECTION OF AUXILIARY VARIABLES 


3.1 Auxiliary Variables 


It is important to discover a possibly existing relationship between the 
variable under investigation and the response behaviour. It is, however, 
not possible to determine such a relationship using the sample data, since 
the values of the variable under investigation are not known for the nonres- 
pondents. To be able to say something about nonrespondents there must be 
information available about them. One source of information about the non- 
response is formed by auxiliary variables. Auxiliary variables are def ined 
as variables which can be measured for both respondents and nonrespondents. 


Two types of auxiliary information can be distinguished: 


(1) Information which can be collected by the interviewer without 
a face-to-face interview. Among the information, obtained in 
this way, are type of town, type of housing, (approximate) year 
of construction of the housing and social status of the 
neighbourhood. 


(2) Information which can be obtained from administrative records. 


Typical examples are age, sex and marital status. 
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Analysis of the relationship between auxiliary variables and the response 
behaviour provides insight in the group of people which do not respond. 
It may give additional information about the relationship between the 
variable under investigation and the response behaviour. Auxiliary vari- 
ables showing a clear relationship with the response behaviour play an 


important role in adjustment procedures, to be discussed later. 


It is assumed that auxiliary variables are nominal variables, i.e. different 
values have no other meaning than to distinguish between different groups. 
Arithmetic operations on these values, which in fact are only labels, are 
not allowed. The assumption that the variables are nominal is in practice 
not a restriction. Many variables are nominal and other types of variables 
can easily be re-expressed in terms of nominal variables. As an example of 
the available amount of auxiliary information, the auxiliary variables of 


the Housing Demand Survey 1977/1978 is listed below. 


(1) year of birth (7) number of floors in the housing 

(2). = sex (8) year of construction of the housing 
(3) marital status (9) municipality 

(4) size of the family (10) quarter of town 

(5) structure of the family (11) degree of urbanization 

( 


6) type of housing 
3.2 Graphical Methods 


As a preliminary tool in the selection of auxiliary variables graphical 
methods have been developed. The advantage of graphical methods is clear. 
They bring out hidden facts and relationships and can stimulate as well as 
aid the analysis. They often offer a more complete and better balanced 
understanding then could be obtained from tabular or textual forms of 
presentation. Furthermore the visual relationships in the plots are more 
clearly grasped and more easily remembered. (See Schmid (1954).) Two 
simple graphical devices are presented in the next sections: the box-piot 


and the windmill-plot. 
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Bows The box-plot 


The box-plot can be seen as a generalization of a histogram or bar chart. 


The name of the box plot is derived from its form (see fig. 2). 


FIGURE Zane HE, BOX=PLOT: 


reponse nonresponse 


A rectangle of standard width and a height proportional to the sample size 
represents the sample. The rectangle is divided in a number of layers (the 
Pareaantes, of the auxiliary variable). The height of a particular layer 

is proportional to the number of sample elements in the corresponding cate- 
gory. Each layer is divided by a vertical line ina left-hand part (the 
response) and a right-hand part (the nonresponse). The areas of these two 
parts are proportional to the amounts of response and nonresponse in the 
particular category. Fig. 3 contains an example of a box-plot. The deta 
originate from the Housing Demand Survey 1977/1978 as far as it concerns 
Amsterdam. The auxiliary variable is the marital status of the person in 


the sample. 
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FIGURE 3. . BOX-PLOT OF MARITAL STATUS IN AMSTERDAM IN 
THE HOUSING DEMAND SURVEY 1977/1978. 


response nonresponse 


A number of aspects may be worth paying attention to: 


not yet married 


married 


divorced 


widowhood 


(1) The heights of the layers indicate to what extent categories 
‘contribute to the sample. Clearly a large part of the people is 
married. The smallest category is the category of people who are 
divorced. 

(2) The extent of the nonresponse can be read from the distance of 
the vertical dividing lines to the right-hand side of the box. 

In this example there obviously is a considerable amount of 
nonresponse. 

(3) If all dividing lines form approximately a straight line there is 


no relationship between response behaviour and the auxiliary 
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variable. Clearly, in this situation there exists a relationship: 
Married people respond better than other people. Response is bad 


in the group of unmarried and divorced people. 
More about the box plot can be found in Bethlehm & Kersten (1981). 
3.2.2 The Windmill-Plot 


The windmill-plot is a graphical representation of the results of corres- 
pondence analysis. Correspondence analysis is a technique for the analysis 
of associations in two-way tables. (See e.g. Benzecri (1976).). A geo- 
metrical representation of the rows (the categories of the vertically tabu- 
lated variable) and the columns (the categories of the horizontally 
tabulated variable) is constructed. This geometrical representation con- 
tains all the information concerning the associations in the table. By 
means of a scaling procedure rows and columns are assigned values in such a 
way that the correlation coefficient, computed by using these values, is 
maximized. To each cell in the table there correspond two scale values: a 
row-value and a column-value. When these values are conceived as coordi- 
nates, a plot of the table can be constructed. In this plot all points 
form an unequally spaced grid. Such a plot may not be easy to interpret. 
To simplify interpretation regression lines are plotted instead of the 
points themselves. Due to the special properties of the scale values the 
regression line to explain y-values from the x-values in the plot has the 


simple form 
YESS iO (1) 


and the regression line to explain the x-values from the y-values has the 


form 
et p,t (2) 


were Pp) is the maximized correlation coefficient. By plotting both regres- 


sion lines the result is the windmill-plot, see fig. 4. 
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FIGURE 4. THE WINDMILL-PLOT 


A number of aspects may be worth noting: 


(1) The origin represents both marginal distributions of the table 

(2) Scale values close to the origin point at categories which 
resemble the marginal distribution and thus have a regular 
behaviour. Far out scale values indicate differently behaving 
categories. 

(3) The relationship between the two variables is strong if the two 
regression lines are near the 5° -line. 

‘h) Projection of a differently behaving category of one variable 
via the regression line on the axis of the other variable 
provides a clue aboutethe dependencies of the categories of 


the variables. 


The plot as described above can not account for all the information in the 
table. It explains as much as is possible in a two-dimensional plot. 


Conditionally on the first plot a:second plot can be constructed, which 


a 4 
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accounts for as much as is possible of the information not yet explained. 
If necessary even more plots can be constructed, but preferably one plot 


is sufficient to explain the major part of the associations. 


A total of s of such plots can be made, in which s is one less than the 
minimum of the number of rows and the number of columns. Let 0p °) 1+ 4 

if pe > S 
be the maximized correlation coefficients. Since 


are x2 /N, (3) 


where x? is the chi-square test statistics for the table and N the general 


total, 


ts Nog /X> (4) 


is a measure of the amount of information explained by the i-th plot 


em ite, ee oS.) 


Fig. 5 contains the first windmill-plot for the variables age (six cate- 
gories) and type of nonresponse (five categories) of the Housing Demand 


Survey 1977/1978 as far as it concerns Amsterdam. 


oe 


WINDMILL-PLOT OF AGE BY TYPE OF NONRESPONSE IN 
AMSTERDAM IN THE HOUSING DEMAND SURVEY 1977/78 


ELGURE 5: 


not at home 
uninhabited 


70+ 


It contains about 88% of the information about associations in the table 


(t, = 0.88). The main reasons for nonresponse of the old people are 


refusal and illness. 
result of the impossibility of making contact: uninhabited, not at home and 


In case of young people the nonresponse is the 


moved. More about the application of correspondence analysis can be fcund 


in Bethlehem & Kersten (1980). 
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3.3 Other selection methods 


There are many other, mainly nongraphical, method to determine the asso- 
ciation between auxiliary vairables and the response behaviour. Much 
about association in contingency tables can e.g. be found in Bishop, 
Fienberg & Holland (1975). 


A popular method for the selection of the most important auxiliary vari- 
ables is AID (Automatic Interaction Detection), described by Morgan & 
Sonquist (1963). In a stepwise process those auxiliary variables are deter- 
mined which can explain as much as possible of the variance of the binary 
response variable. There are disadvantages which make reliable application 
of this method doubtful. As the selection process proceeds in a stepwise 
fashion there is no guarantee that the optimal solution will be found. 
Because there is no stopping rule based on a statistical model this sense 
the result is rather arbitrary. Further research in this field is necessary 
(see e.g. Kass (1980)). 


h. REDUCTION OF NONRESPONSE BIAS BY SUBGROUP WEIGHTING 


When a relationship is found or suspected between the variable under 
investigation (Y) and the response behaviour (R) measures have to be taken 
in order to reduce the nonresponse bias. In this section a number of 
adjustment procedures are discussed which are'based on subgroup weight- 


ing. Attention is focussed on estimating the population mean of Y. 


It can be shown that the bias, introduced by only using response values, 
is proportional to the covariance between Y and R. If it would be pos- 
sible to divide the population in a number of subgroups in each of which 
the covariance is neglectable, then (nearly unbiased) estimates of the 
subgroup means can be combined into a (nearly unbiased) estimate of the 


population mean. 


Let the finite population consist of N elements Ur» Uo» nee Uy with Y-values 


Vos Y 


1? Yor +9 Yur From this population a simple random sample Ups Ugs seo 


Siu 


= WAehs 


(stochastic variables are underlined) of size n is selected without 
replacement. The corresponding y-values are Yy> Yor. and the 
response behaviour is indicated by ee iF 1 indicating response 
and Cr = 0 nonresponse). In fact Z; can only be observed for those sample 
elements u for which Ls 1. The m responding elements are denoted by 
ts fe 


* ; re * * 
a (ie Sys Comme r,)> with y-values yy, Yox ++» ¥ im 


Let X be an auxiliary variable inducing a division of the population in 
H subgroups with sizes Ni» No» .., N,. In subgroup weighting first of all 


a ol 
in each subgroup h an estimator for the subgroup mean is computed: 
h g p 


h 
yA (Heewts oh 200) (5) 


* * * ‘ 
where Yui? Yh2? hist: y hm, are the values of the m responding elements 
in subgroup h. The subgroup estimators Yy> Yee bteas Yu are combined into 


J. 


a population estimators ie 


WEYh (6) 
The type of estimator is determined by the available amount of information 
about the weights Wipe Wor veo Wie 

if the sizes Nis No» fe ie Nu of the subgroups are known the situation 


is equivalent to poststratification. (See e.g. Holt & Smith (1979).) The 


weights are not random but fixed quantities: 


If these sizes are not known they can be estimated by 
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725 --5H) (8) 


where n is the number of sample elements in subgroup h (n=n 


Lag 


ot+-tny,). 


In an intermediate situation where two auxiliary variables X, and X, are 


I 
used and only the marginal totals of the two variables are known, a raking 
procedure can be applied to estimate the weights (see e.g. Chapman (1976)). 
induces G groups and X 


Suppose X 5 induces H groups. Crossing X, and X 


l l 2 
results in a subdivision into G x H groups. If only the marginal totals 
Noe (eal ome a Or X, and Nee (h=122) oa), )iof X, are known then by using 
the sample information good estimates wae of ae can be computed. The 


weights are then equal to 


ued can | (j=l, had eaee sh) (9) 
aa N 


All three estimators have, when used in the same grouping situation, -he 
same bias, but the greater the amount of available information on the sub- 
group sizes the smaller the variance of the estimate. Subgroup weighting 
has two advantages: reduction of the variance of the estimate and 
reduction of the response bias. The most extreme possibilities are 
illustrated in fig. 6. If two variables are connected it means that they 


have a strong correlation. 
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FIG. 6. VARIANCE AND BIAS OF ESTIMATORS BEFORE AND AFTER SUBGROUP WEIGHTING 


O) <a l CG) 


case > 


casey 4 


ZN parameter to be estimated 


--- before subgroup weighting 
after subgroup weighting 

Y variable under investigation 

R response variable | 


X auxiliary variable 
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A number of conclusions can be drawn: 


(1) If nonresponse bias exists subgroup weighting is significant when 
X and R are correlated (case 2). Both bias and variance are 


reduced. 


(2) If no nonresponse bias exists a correlation between X and R has no 
effect (case 4). Only correlation between X and Y reduces the 


variance (case 5). 


Because the data on the nonrespondents are missing, it is impossible to 
use the remaining data to find an auxiliary variable X which is highly 
correlated with Y. It is, however, possible to use this data to look for 
auxiliary variables which are hiqhly correlated with the response 
variable:R. If such a variable has been found, application of it in 
subgroup weighting will reduce the nonresponse bias (if it exists), but 


not always the variance. 


5. Other adjustment methods 


Several other adjustment methods appear in the literature. Several of 
them will be discussed in this section. Some of them need further 


research to establish their merits. 


5.1 No adjustment 
ees { 


In some situations no adjustment is necessary. If it appears that no 
relationship exists between the variable under investigation and the 
response behaviour the response can be considered as a random sample 

from the population. Also if statements are restricted to the population 
of potential respondents no correction is necessary. In all other 
situations no adjustment is only justified if the category "nonresponse! 


is included in all tables in publications. 


oe 
5.2. Imputation 


Imputation procedures solve the problem of missing observations due to 
nonresponse by substitution of values in the records of the nonrespondents. 
In "hot deck'' imputation data are taken from respondents of the current 
survey, while in ''cold deck'' imputation data are taken from a previous 
survey. If the response structure of previous and current survey 
resemble each other the results of cold deck imputation and hot deck 
imputation will roughly be the same. Imputation can be carried out in 


several ways. Some of them are: 


imputation of a random respondent 


(1) 

(2) imputation of the mean respondent 

(3) imputation of a random respondent within the same subqroup 
(4) imputation of the mean respondent within the same subaroup 
(5) imputation of a value obtained by fitting a model 

(6) imputation of upper or lower bounds 


Procedures (1) and (2) do not reduce the bias. Procedures (3) and (4) 
resemble subgroup weighting. The effect of procedure (5) depends strongly 
on the fit of the model and the reasonableness of the model assumptions. 
Procedure (6) gives insight in how bad things een’ be if no adjustment 


would take place. 


5.3. Adjustment for not-at-homes 


The well-known method of Politz & Simmons (1949) tries to adjust for 
not-at-home bias by estimating the probability to find a person at home. 
This is performed by asking respondents e.g. how often they were at 

home at the time of the interview during the previous days. The at-home- 
probability, constructed in this way, can be used as a stratification 
variable. It is also worth trying to find a model which explains the 
relationship between the variable under investigation and the at -home- 
probability. Extrapolation of this model to the group of not-at-homes may 


provide more information about this group. 
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5.4. Adjustment for refusers 


It is possible to measure the willingness of people to co-operate in the 
Survey (see Van Tulder (1977)). Using this information a procedure 
analogous to adjustment for not-at-homes can be carried out. Furthermore 
the willingness to co-operate is a measure for the survey climate. The 
construction of a scale to obtain this information will probably be 


somewhat more difficult then in the case of not-at-home adjustment. 


5.5 Double sampling 


In order to get more information about nonrespondents Hansen & Hurwitz 
(1946) propose selecting a sample from the nonrespondents. Specially 
trained interviewers try as yet to obtain (part of) the missing informa- 
tion. Time and money constraints often prevent application of double 


sampling. 


5.6. The principal question 


If the method of Hansen & Hurwitz is too expensive the principal question 
procedure may offer a substitute. In many surveys there often is one 
important basic question around which the survey has been constructed. 

If during the field work problems are met with completing the whole 
questionnaire, the interviewer may try to get an answer on only the 
principal question. This may even be tried afterwards by letter or by’ 
telephone. This technique will shortly be tried out in one of the 


surveys of the CBS. 


6. Conciusions 


In view of the rise in nonresponse rates during the past years it is 
important to carry out thorough research on the impact of nonresponse on 


the quality of the survey. 


Quite a few adjustment procedures appear in literature, which all aim 
at reduction of the nonresponse bias. A comparative study of these 


procedures has to provide decisive answers about their merits. 
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The large differences which exist with regard to objective, design and 
execution of surveys prevent correct interpretation of differences in 
nonresponse figures. It is therefore necessary to create a theoretical 


framework which allows proper comparison. 


Of course reduction of nonresponse during the field work will remain 


an important topic. 
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SURVEY METHODOLOGY 1981, VOL. 7 NO. 


ON THE VARIANCES OF ASYMPTOTICALLY 
NORMAL ESTIMATORS FROM COMPLEX SURVEYS 


David A. Binder | 


The problem of specifying and estimating the variance of 
estimated parameters based on complex sample designs from 
finite populations is considered. The results of this 
paper are particularly useful when the paramtere estima- 
tors cannot be defined explicitly as a function of other 
statistics from the sample. It is shown how these results 
can be applied to linear regression, logistic regression 
and loglinear contingency table models. 


1. INTRODUCTION 


In recent years, there has been an increasing demand for using survey 
data to estimate the parameters of traditional models such as regres- 
sion parameters, discriminant functions, logit and probit parameters 
and others. However, for many such surveys, the primary objectives 
of the survey is the estimation of population or sub-population means, 
totals, trends and so on. For this reason and because of operational 
considerations, the survey design is often not a simple random sample, 
but is more typically stratified and often multi-stage with possibly 


unequal probabilities at certain stages of sampling. 


Because of this, there has been much discussion (see, for example, 
Sarndal;1978) on whether the sampling weights should be used in making 
inferences about these model parameters. The answer seems to depend on 
whether a superpopulation model is appropriate for all population units. 
lf this is the case, the inference on the superpopulation parameters is 
often the primary concern. This leads to model-based inference, where, 
for a given sample, the inferences do not depend on the sampling weights. 


D.A. Binder, Institutional and Agriculture Survey Methods Division, 


Statistics Canada. 
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The question that comes to mind is: If the superpopulation model is 
not appropriate, what parameters are we estimating? It must be recog- 
nized that for many studies, particularly in the social sciences, the 
model (e.g. linear regression) is only a convenient approximation of 
the real world and the parameters of that model (e.g. correlations and 
partial correlations) are often used to understand the approximate 
interdependencies of the variables rather than having a_ particular 
scientific interpretation. Therefore, the parameters we are estimat- 
ing do not necessarily refer to a true superpopulation model, but are 


of a more descriptive nature. 


In this paper, we adopt the view that we are interested in making in- 
ferences about these ''descriptive'’ parameters of the population. For 
example, suppose X and Y are Nx p and Nx | matrices respectively, 
where each row of X and Y corresponds to a different individual of the 
population. We are interested in the descriptive parameter, B, a px] 


vector satisfying the equations: 


x'xB = xTy (1.1) 


This view of descriptive parameters is: the same as that taken by 


Frankel (1971) and Kish and Frankel (1974). 


The usual estimation of such parameters normally takes into account 
the sampling weights. If we denote by 1. the probability that the i-th 
unit in the sample is sampled and let I = diag (Ts Siae es then the 


weighted parameter estimate for B satisfies: 


Be age og! 


~ 


Te Y, (152) 


~ 


where x and y are nxp and nx] matrices respectively, the rows of which 


correspond to the sampled rows of X and Y. 


Suppose, now, an estimator of a population parameter can be expressed 
as: 


(1.3) 
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where E(z.) = Z.. Here, 6 is an estimator of g(Z,, yee 74 Fol low- 


yes 
k 
ing Tepping (1968) and Woodruff (1971), a Taylor series expansion for 


6 yields: 


vil evi x (S8)(2,-z,)] . (1.4) 
(aigegie 

These formulae are exemplified for estimation of regression coeffi- 

cients (1.1) by Tepping (1968). However, the expressions resulting 

from (1.4) for the variances of the regression coefficients are some- 


what complicated compared to those derived by Fuller (1975). 


In this paper we consider parameters which are not defined through an 
explicit equation such as (1.3), but instead are defined implicitly as 
U(Z,6) = 0. A simple example showing the distinction would be the 
ratio parameter: 
LY 
mie acllss 


2X 


which could also be defined implicitly as: 


BY oc) REX) # = 9012 


When we deal with some models such as indirect loglinear models or 
logistic regression models, the parameters can be defined only through 
implicit relationships. The extension of Tepping's (1968) results for 
this case is fairly straightforward,but does not appear in its general 
form at present in the literature. There are, however, specific 
examples of its application; see, for example Fuller (1975) and 


Freeman and Koch (1976). 


In Section 2 we give the general framework and the main results of 


the paper. A number of models are exemplified in Section 3. 
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2. GENERAL FRAMEWORK 


2.1 Framework 


The population units are labelled 1, ..., N. Associated with the i-th 
unit we have a q-dimensional data vector Xo We have a parameter space 


Ore Re. “the Parameter 6 = ( yris = defined “by vthe p 


OTe? SoS Pate 
equations: 


~ 


N 
UMCX 0m) eee u; (X, 585) = Vo) = 0) (Pea) 


for i=l, ..., p. We assume that equations (2.1) define oF uniquely 
in ©. We also assume that du, (X,8)/28 and dv. (8) /38 exist in a 


neighbourhood of a4; CA simple example of (2.1) is where obs is a popu- 


N 
ee Bor Here, u(X,,6.) = X 


lation total, and we have U(X,@_) = : 
~’ Oo Kea oO 


k 
and v(8.) = ae 
We select a sample of the units, according to some probability distri- 
bution defined on the set of all non-empty subsets of {1, ..., N}. We 
denote by Xpo reese Xe the selected values of Xi ae Xa We assume 
that for any.8 € 0, we can construct a consistent, asymptotically nor- 


mal estimator of U.(X,8). We denote this estimator by Valeees) For 


example, for many without replacement sampling schemes, 
A n 
U.(x,6) = = u. (x, .8)/m - v, (8) (2.2) 


will be a consistent asymptotically normal estimator, where T! is the 


probability of inclusion for the k-th unit. 


We let o; (X48) = Cov[U, (x,8), U.(x,6)]. For example, for estimator 


7 Fol 


where To is the probability that the k-th and 2-th units in 


sample. 


We let 2(X,6) be the pXp matrix with entries 0; (X,8), and 3 (x,6) be a 


consistent estimator for 2. Now, for any given 0, 


N 
Up AN ee) Preah) = en deux, 8) 5 


so that estimators U. (X,8) and E(x,6) can be specified for any design 
in which we can derive consistent asymptotically normal estimators of 
population totals and consistent estimators for the variances of the 


estimators of the totals. 
The Horvitz-Thompson estimator for (2.3) is: 


n 
ee rg, ee ee (2.4) 


1 2=1 


In the case of fixed sample size, the Yates-Grundy estimator of (2.3) 


is: 
Wel xen ye SU (XO kates (xs Gd) Un (x, 6) 
: pS (ace en (2.5) 
ren ™! To T™ TT) k 2 kX 


Letting U(X,@) and U(x,6) be the p-dimensional vectors with components 


U. (X,8) and U, (x,8) respectively, we define 


J(X,8) = 9U(X,8)/98 (2.6) 
J(x,6) = 9U(x,8)/28, (2.7) 


where J and J are pxp partial derivative matrices. Assume that the 
matrices are continuous functions of 9 and that the partial deriva- 
tives with respect to 8 exist in a neighbourhood of 6. Also assume 
J(x,8) is a consistent estimator of J(X,6). . - 


Our estimator for 6 is given by 8, the solution to: 


U, (x,8) = 0 rr i=l5) 02.5 ee (233) 
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We assume the sample size is sufficiently large so that the solution 
to (2.8) is unique in 9. We show in the next section that the covar- 


jance matrix of 6 can be consistently estimated by: 


[seleanies cee» ce) 1 


~ 


2.2 Asymptotic Theory 


Following the asymptotic arguments of Madow (1948), and Hajek (1960), 
we consider a sequence of populations indexed by t, with sizes ylt) 
and data er We assume eee as t>*. For population t, we sel- 


(t) (t) FNS 1h 


ect a sample of size n and observe data x . We let v 


and assume 


(t) 


limvy = 0 
tr 


] im cnt) = yt) Saeen 


t>o 


(t) 


For any 6 in a neighbourhood of a0 


tvt)7% Gon"), 9) - vex") oy /n't? 
is asymptotically N[0,S(@)], where 


we assume 


s(9) = limiv'*) x(x!) oy sgn ft) 374 
exists. We assume 


K(0) = lim s(x) 9) py 62) exists and also 


~ 


plim J(x"*? ,ayn®) = Kio). 
Also, we assume 

lim[rank ca(x'®) 9) 3] = plim[rank txt) 9) 31 =p. 
We define g(t) to satisfy 


U(x"), 6 


By a Taylor series expansion, we obtain 
Gx"), antl aah Ce g(t), (g(t) agithys (2.9) 


Since the left hand side of (2.9) is asymptotically normal, we have that 
(n't) y’2 (6 (t) * g (t)y 
VG Com ik(ee) Joe 


is asymptotically N[0, G(e)], where $(é.) = K(o. 9) LK (8, 


- lio 9s 


Therefore, 


g(9,) = [kK (8,)] $(@,) th (e.)1! (2.10) 


and a consistent estimator for G(8.) is”: 
nt) 971 6,8) £(x,8) (37 G, 80I". (2.11) 


Hence, when the functional form of U(x,6) and i(x,6) is specified, we 
need only derive the matrix J(X,9,) and its estimator J(x,8) to use 


these results. 
3. EXAMPLES 


3.1 Introduction 


In this section we consider in detail the implication of the general 
formulation given in Section 2 with respect to estimating the vari- 
ances of certain population parameter estimators. In particular, we 
discuss ratios, regression coefficients and log linear models for cat- 
egorical data. Other models, such as probit models could be analyzed 


analogously. 


In general, we use the following notation. If Wis > Wa are popula- 
tion values, with W = TWh then on selecting a sample Wys +++9 Woy We 
have an unbiased estimator of W given by W. We let v(W) represent the 
covariance matrix for W and 0 (W) a consistent estimator of v(W). The 
Particular form of this estimator will depend on the sample design; 


for example, multi-stage stratified, pps with replacement, etc. . 


Be ZeaRatlos 


Suppose we are interested in R EX/EX We define 


U(X,R) = EX ~ REX, 1. 


Therefore, for without replacement sampling, we have : 
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Setting U(x,R) =20 ,8we obtain 


R= X/X, . (ial) 

We define We = Kio = R Xeqe 
A A eZ. 
Since, J(X,R) = “2X1, we have that V(R) is approximately V(W)/(ZX,,). 
This is estimated by O (6) /%. In the case of stratified sampling, 


this yields the same result as in Woodruff (1971). 


3.3 Regression Coefficients and R 


Suppose our data matrix X is partitioned into (Za the first column 
of Z being the vector of I’s., The vector Y fs Nx1l. We have parameters 


of interest 6, B; and R? defined by: 


U, Fa Poca hte baad: (3.2a) 
U, = ee) ZB ee Ze Y =.0; (Ger, 
U3 = (yTY-N V7) (R2-1) + y! P= y! 7=Be= 40 (3720) 


‘ Age PE A 
Here, B denotes the vector of regression coefficients, R is the 


coefficient of multiple determination and 6 is the total of the Y's 


We first consider the case where N is known. We let SSY = yy Nae 


8 
We also define $ as the estimator for ae S the estimator for y'y 


jg - yy 
and Soy the estimator for Z Y. We therefore have : 
6 — ve (33a) 
eel 
B 7 277 27y, (3.3b) 
AT 
S58 =) Belts 
ae YY LN. 
Ree tla mG ERE a (3.93) 
Svy = N y, 
and 
oy tn 0 
2 “ 7, 
ney SU Ws Yoo ,6)/3(B,R,6) = 1h Q Q ; 
-y'z ssy 2¥(1-R’)| 


where Y = 6/N 
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Therefore, 


0 eae 
-)_ | os 2 T 
eRe aot ane) fos Ye" Bx/ SSY b/SSY 
0! 0 
i Toes eis. 
Now, letting W, (B) = (Z, ens : Zh e.)» where epe= be z Ze Bi 
we obtain: 
vie] = (z'z)"! vtw(e)](z'z)"!. (3.4) 


~ 


This is a direct consequence of (2.10). Note that the set of W, (B) 
vectors corresponds to U, in (3.2b). Fuller (1975) obtains the same 
result for stratified or two-stage stratified sampling. 


To estimate (3.4) we use: 


VB] = $5) 1W(B)] $55. 
We can also estimate the variance of Rae If Wy (BR) = [Y,» Zi ep 
fo Ee A OREN RY, )] and c! = [-29(1-R*)/n, 81, 1]/ 
(Syy-N | Ae ae 
GIR] = ¢! VIW(B,R)] c. (3.5) 


For the case where N_ is unknown (e.g. the primary sampling units are 
geographic areas), we have the additional equation: 
=N- AG. 
Oi Dea (3.6) 
Adding the appropriate row and column to J and inverting, we obtain 


~ 


the following results for estimating v[R?]. 


We let 
T 2 Z 
= TATE. | 
WCBAR) = [er Zey Ske ores Zp Oke Yale My BOR MD? 
and 
a x A A ee. a2 A a-] a2 
Eigen VUIER J/NB 4-1, Y U-R Y/N I/(Syy - 87! 9°) 


= 66s 


We then have v[R*] is given by (3.5) for these new values of 
oe 
W (BAR )wandecs 


3.4 Logistic Regression 


As in the previous section, we assume the data matrix X can be parti- 
tioned into [Zales but now Y is a vector of O'seand i's.) linwtnesthas 


ditional statistical framework, the logistic regression model for Y 


~ 


conditional ‘on “Zo lassentsy that Yy> sare Yy are independent with 
PA ie le = p(B) ; where : 
Ti P 
exp(8) 2.) (3.7) 
RE i ge reecsmearnerme reer 
1 + exp (8 z,) 


Letting B be the maximum likelihood estimator forse, jwechavegthare ss 


satisfies 


Oe ne) cee em (3.8) 
where P(B)' = [p,(B), .., py(B)1. 


For a given finite population,we define B as our parameter of interest. 


We let C(B) be our estimate for z'P(B) and Sj, our estimate for ZY 


ZY 


Therefore, B satisfies C(B) = S These equations must be solved 


ZNe 
iteratively in general. We also have 


The (1',§)th component’ of Jr's 27252 °Z'0" p(B), [=p (B) 1." We denote 
- nk flee Kegs Ko 
the estimator of J by dh 


To estimate the variance of B, we let 


Te A : 
pie ra tee anes é,) 
where e, = p,, (B) - Yes The estimator for vV[B] is given by: 


vw) 3 


= lo R= 


3.5 Loglinear Models for Categorical Data 


Suppose that each member of the population belongs to exactly one of 
q distinct categories. Associated with category i we have an rx]l 
vector a; such that the proportion of individuals in the i-th category 


is approximately 


We let p(s)" = [p, (8), ep p, (8) ] and NT = (N., ie NQ); where N. 
is the number of individuals in the i-th category. Now, if the popu- 
lation were generated from a multinomial distribution with probabili- 


ties p(8),the maximum likelihood estimator for 8,given by B, satisfies: 


u=A'N- [AT p(s)] 1’ N= 0, 


~ 


where A is a qxXr matrix with i-th row being a; - We consider B as 


~ 


Our parameter of interest for any given finite population. 


We let N be a consistent asymptotically normal estimator of N, with 
variance-covaraince matrix VIN] and estimated matrix VIN]. Our esti- 


mator, B, satisfies: 


~ 


p(B)] 1) N= 0. (3.9) 


This estimator was suggested by Freeman and Koch (1976). It may be 
less efficient than Imrey, Koch and Stokes (1981, 1982) functional 
asymptotic regression methodology; however, we need not calculate all 


the components of VIN] to apply (3.9). 


Let D(B) be diag[p(B) ] and H(B) = D(B) ~ p(B) p(B)’. We have: 
dU 
J = yp (1'N) Al HOB) A 


Therefore the asymptotic variance matrix for B is given by: 
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es eal 
a'(r-p(B)1") vA] (1-1 p(B)') A(AT H(B) A)” | (3.10) 
This expression can sometimes be simplified as follows. If it can be 
assumed that N/NTT = p(B). thenutor mi = Avan we have: 
Aeuk eT? Tbe de v 
viel = (NT1)7? (x-p(B)1") vEW(t-1 p(e)"), 
so that 
vis} = (a' H(B) A)! al vem] a(a’ H(B) a)! . (3.11) 


We also have that the covariance matrix for p(B), the estimated cell 


probabilities, is given by: 


The estimators of v[B] and V[p(B) ] are similar expressions, where N 
and B are replaced by N and B respectively. These assume that VIN] is 
readily available. For some problems where q is relatively large com- 


pared to rr, it would be more efficient to proceed as follows. Let 


Ye = |] if k-th unit in i-th category 
= 0 otherwise, 
for k=l Ns i=1 qa tetey ey vb), and 
9 b] 3 >) 3 ~k ale ) kq ) 
Se line enane lt 

Were [I-p(B) 1 ] YR 
We then obtain: 

een A Baa rR ZT let eS = 

068] = (W1)” (aT HB) a)! vw) (aT H(B) a)! 


We remark that the methodology described in this section can be readi- 
ly extended to product-multinomial type models, where we have a log- 


linear model for iN 2s but the margins {z Ne are known. 
j 
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4. DISCUSSION 


The techniques described in the paper have been described for some 
specific models; see, for example, Fuller (1975) and Freeman and Koch 
(1976). However, the general results are not explicitly described. 
Many standard statistical packages may be used for the estimation of 
the parameters of the models described, but the variances and tests of 


hypotheses given in these packages will not be valid. 


The results of this paper depend on the assumption of asymptotic  nor- 
mality of the estimators. Empirical studies on the validity of these 


approximations are important. 


An alternative methodology to estimating many of the parameters des- 
cribed here is given by Imrey, Koch and Stokes (1981, 1982). Their 
functional asymptotic regression methodology also falls within the 
general framework described here, with respect to variance derivation 


and estimation. 


[2] 


[6] 
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AN OVERVIEW OF CANADIAN HEALTH STATISTICS: 


PAST, PRESENT AND FUTURE | 


Lorne pawenor tone 


The author briefly reviews the factors determining the 
production of health statistics in Canada, with particular 
attention to the different sources of data and to the long- 
standing co-operation among the many agencies involved in 
the gathering of health-related information. 


Mr. Chairman, I want to express my real pleasure at being a member of 
this panel because of the opportunity that it affords me to congratu- 
late Dorothy Rice and her colleagues in the National Center for Health 
Statistics on the occasion of the completion of 25 years of Health 
Surveys. We in Statistics Canada have long been admirers of NCHS 

and my congratulations to Dorothy are on behalf of my colleagues in 


Statistics Canada, particularly those in our Health Division. 


Consistent, | hope, with the charge of our Chairman, |! have chosen to 
paint with a very broad brush what seem to me to be trends and deter- 
minants of our health which might find echos in other countries and 


therefore be of interest to this audience. 


Two data streams comprise the historic and current sources of Canada 
Health’Statistics. The first is health institutions - predominantly 
hospitals, both general and mental. From them we derive statistics 
about a wide range of their characteristics, as well as statistics about 
their patents and their illnesses. Canadian hospital statistics are 


amongst the most detailed and comprehensive in the world. 


1 
As presented at the American Statistical Association Annual Meeting in 
Detroit, August 1981 ; 


Z Lorne Rowebottom, Assistant Chief Statistician, Institutions and 
Agriculture Statistics Branch, Statistics Canada. 
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The second stream comprises the records generated by registration of 
births, marriages and deaths from which we derive the critical statistics 


on causes of death. 


A wide variety of statistics is produced from such rich data bases and 
some important statistics are derived from other sources, for example, 
those on cancer incidence, from cancer registers, and notifiable diseases. 
For those who are interested | have a few copies of a Directory of Health 
Division Information and also | would be glad to send a copy to anyone who 


wrote to me at Statistics Canada. 


The important themes relating to these statistics that | want to touch on 


this morning are the following: 


- First, they measure illness only when individuals seek health 
care from institutions. 

- Secondly, they illustrate the strengths and weaknesses of statistics 
derived from surveys and from administrative records. 

- Thirdly, they represent the availability of information which could 
only result from a very high degree of co-operation, sustained over 
a long period of time, between the central agency, federal and 
provincial departments of health, the institution and hospital 


associations, and vital statistics registers. 


| will return to these three characteristics of the health statistics 
system: what is measured and what is not, the implications of data sources 


and the degree of co-operation between the players in the system. 


Why have we produced what we have, rather than different products by 
different means? Looking back over sixty years of health statistics, | 
found this an interesting question. Assessing how priorities were deter- 
mined is a judgemental process - just as is deciding on today's priori- 
ties. So it is my judgement that in part we responded to changing needs 
for statistics articulated by users and Royal Commissions, and in part 


we anticipated changing user needs ourselves and used existing data 
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sources which related to such needs, and because they represented oppor- 
tunities. They were there to be utilized, like the vein of quartz that 
a prospector seeks and finds, or stumbles across. In part, we were 
driven by, and we exploited, the rapidly changing technology. In part 
the environment of co-operation in which we worked determined what we 
did. And finally in many parts the resources available to us in terms 
of dollars, human skills, and data handling capabilities, permitted some 


things and not others. 
These few critical factors: 


- articulated and perceived needs, 
- data sources available, 
- changing technology to process and to analyse data, 
- co-operation between players in the system, 
- budgets available, 
have been the determinants of what we have done. But it will be apparent 


to you that they are also the determinants of what we are and will be doing. 


These forces shift and come together in a changing kaleidoscope so that 
during one span of time one combination is dominant, to be replaced by 


another combination. 


In Canada all have operated in such ways to bring about significant changes 
in our health statistics and it seems apparent that there will result even 
more rapid change. Changing needs should, of course, drive the system and 
they are in fact doing so, albeit in some respect in an erratic manner. You 
You will recall my stating that the Canadian measurements of morbidity are 
largely limited to hospitalized illnesses. This has been widely recocnized 
as a quite unacceptable state of affairs and a few years ago this dissatis- 
faction led to a federal decision to institute a continuing health status 
survey of the Canadian population. A survey was carefully planned and 
tested from both conceptual and methodological points of view. However, 
only 10 months! data were collected before government-wide budget 


reductions forced cancellation of the survey. The first results from the 


Sn is 


data collected have just been published and the data base has shown signs 
of being a rich research source with significant decision-making implica- 
tions. Of course, it suffers from the severe limiations of relating to 
only one point in time. It is too early to state how long it may be 
before a decision to reinstitute some form of the Canada Health Survey is 
made. However, | am optimistic that the capacity of such measurements of 
health status - to throw light on the effects of our lifestyles on our 
good health and illness, and lead to individual and collective decisions 


which will affect them - will not be ignored for long. 


Let me turn from the area of health-related household surveys where the 
Canadian track record of responding to changing needs is poor, to one 
where we have both anticipated and responded effectively to new demands. 

| refer to epidemiological studies designed to enlighten the kinds of 
health risks resulting from exposure to various demographic, social, occu- 
pational and environmental influences. Thanks to the foresight and 
persistence of members of our Vital Statistics Staff working with a few 
other key persons both within and outside Statistics Canada, we have a 
computer-searchable Mortality Data Base file which includes all deaths in 
Canada, coded by cause of death, extending back over three decades. We 
also have a generalized record linkage facility which is being used to 
link specific exposed population groups to the mortality file. Linkages 
are also possible to an as yet incomplete but significant ten-year cancer 


incidence file. 


A paper which includes a largely Canadian bibliography on this area wil] 
be given by Martha Smith, Head of Occupational and Environmental Health 
Research Unit, in Scotland before the end of this month. It will be 
available on request. (Both Martha and John Silins, Chief of our Vital 


Statistics and Disease Registries Section are in the audience.) 


As to other data available to shape the future of Canadian Health Statistics 
| will only take time to mention the existence of data bases which are 
very large, potentially very rich, and largely unused for national 


statistical purposes. 


= 175 - 


They comprise the administrative records of our national medicare system 
which record annually in excess of 30 million incidents of primary 
medical care extended by physicians. We have demonstrated some of the 
Statistical potential of these files and we are now shaping new proposals 
to develop their’ use during the next several years. Budgets are expected 


to be the limiting factor. 


New needs should drive the system - new technology does. The influence of 
computers on health statistics is all-pervasive and is operating to change 


the availability and uses of health statistics in profound ways. 


| want to comment on the use of data - in the form of statistical inform- 
ation, which computers have made possible - by managers, medical personnel 
and administrators in hospitals, local hospital districts, states, provin- 
ces, universities and associations. At federal levels, computers have 
changed the ways in which data are processed and statistics are used. But 
in many locations throughout the health community, computers have meant 
that data are now used for purposes of understanding, for research and for 
decisions, whereas in the precomputer era they were used little or not at 


all. 


Allowing for some exaggeration - but probably not very much - it was not 
that long ago when national statistical agencies had almost a monopoly on 
large-scale data handling capability. What a contrast between then and now 
when large, fast, sophisticated and easily used information processing 
capacity is economically available to both large and small organizations. 
The implications are far-reaching and | suspect not yet fully perceived, 


but they include at least: 


- The existence of many rather than few producers of statistics 
(many of these will perceive themselves as operators of MIS but 


statistics is - and will be - the game if not the name.) 


- These same organizations will also be much more intensive users 


of statistics - particularly statistics about their own organizations 


or jurisdictions. 


Vs: Fe 


- As a result there will be greater knowledge of one's own 


environment. 


- There will be greater independence on the part of such organi- 
zations and their need - maybe much less perceived need - to rely 


on others for statistics. 


- This ability to utilize the information contained in the adminis- 
trative records of one's own organization or jurisdiction will 
almost certainly reduce the tolerance for completing statistical 
questionnaires, with a resulting increase in the necessity to rely 
on administrative records. This could result in less information 
being available about the total environment because of the problems 


of data comparability between organizations and jurisdictions. 


| find it difficult to forecast the impact that these changes will have on 
co-operation between the many players essential to development and mainten- 
ance of a comprehensive and inevitably complex system of health statistics. 
All | can say is that in Canada - notwithstanding substantial pressures 
which test and strain the system —. co-operation has not diminished. In 
fact, the reverse is the case and on this score also | am an optimist. | 
think that one determinant of such co-operation is for national statistical 
agencies to recognize that their role must change in response to the kind 
of changes | have described. It is apparent to me that priorities must 


shift from statistical production to statistical co-ordination. 


One final word about what | consider to be an overriding priority, namely, 
doing statistical analysis of our data bases to determine the messages that 
are in them, to determine their meaning and significance, and to relate 


them to the issues and problems confronting us. 


For too long, we, at least we in Statistics Canada, have published numbers - 
myriads of numbers - and failed to translate them into significant indi- 
cators. We have left it to others to find the gold inthe ore we have mined. 
| think that we and the health community have paid a high price for our 
failures (there have been successes) to find the gold, and even shape it 
into jewellery with which users would enlighten our world, not unlike the 


way necklaces lend radiance to those who wear them. 
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MODELS FOR ESTIMATION OF SAMPLING ERRORS | 


2 
P.D. Ghangurde 


This paper presents results of an empirical study on fitting 
log-linear models to data on estimates of characteristics and 
their coefficients of variation (CV) from the Canadian Labour 
‘Force Survey. The characteristics were classified into 
groups on the basis of design effects and models were fitted 
to data on estimates of characteristic totals and their CVs 
over twelve month period. The models can be used in 
situations where estimates of CV are needed for new charac- 
teristics, and for providing more precise estimates of 
reliability of estimates based on past data. The problem 

of evaluation of fit of the models is considered. 


1. INTRODUCTION 


This paper presents results of an evaluation study on models for esti- 
mation of coefficient of variation (CV) of estimates of characteristics 
based on the Canadian Labour Force Survey (LFS). The LFS is a monthly 
household survey with a stratified multi-stage area sample design with a 


sample size of approximately 55,000 households. 


Each month estimates of CV are calculated for a set of characteristics 
using Keyfitz method of variance estimation based on Taylor series 
approximation [4], [5]. However, computation of appropriate variance 
estimates for all estimates tabulated from a large scale survey such 


as the LFS is not possible due to operational constraints of time and 


] Presented at the American Statistical Association Annual Meeting 
in Detroit, August 1981. 


2 P. D. Ghangurde, Census and Household Survey Methods Division, 
Statistics Canada. 
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costs. The model-based estimates of CV can be used to obtain preli- 
minary estimates of reliability for new characteristics based on the 
past data, and when estimates of CV for an extended period (e.g. one 
year) are needed. The models can also be used for obtaining concise 
estimates of reliability, e.g. alphabetic indicators for ranges of 


CV. 


In section 2 the linear and non-linear models used for estimation of 
totals and proportions are explained. Sections 3 and 4 review con- 
siderations made in forming groups, fitting models and evaluation 


of goodness of fits. 


2. THE MODELS 


The LFS is a monthly household survey in which dwelling is the final 
stage sampling unit. Each of the ten provinces in Canada are divided 
into economic regions which consist of groups of counties with similar 
economic structure. The economic regions are divided into geographic 
strata and multi-stage area samples are drawn without replacement with 
two stages in self-representing strata in the large urban centres and 
three or four stages in the non-self-representing strata in rural areas. 
The sample selection in the initial stages is with probability propor- 
tional to population size and that in the last stage, in which dwellings 


are selected from clusters, being systematic. 


The design-based estimates within strata are obtained by weighting the 
data by inverse of probabilities of selection. An adjustment of the 
basic weight for non-response and ratio estimation within age-sex groups, 
which are post-strata, is used to obtain final estimates. The census- 
based population projections for age-sex groups within each province are 
used as auxiliary variable totals for ratio estimation. More details 


on the sample design and estimation are given in [5]. 
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The variance estimates of various characteristics at the province 

level are obtained by Taylor series approximation assuming that the 
primary sampling units (psus) within non-self-representing strata are 
selected independently. In self-representing strata the sampled clusters 
are divided into two groups, which are treated as pseudo-psus and are 
assumed to have been selected independently. The variance estimate for 
an estimated characteristic total at Canada level is the sum of corres- 
ponding provincial variance estimates [5]. The variance of an estimate 


X of a characteristic total X in a province can also be expressed as 


v(x) = F (WI) xX (1-4), (1) 
where P = population for the province, 
W = inverse sampling ratio, 
F = design effect for the characteristic, and 
n = sample size (persons). 


The expression (1) for v(X) relates the variance obtained for the 
complex ratio estimate based on a stratified multi-stage sample design 
_to the variance of the estimate based on a simple random sample of the 
same size drawn from the finite population of size P. The sampling 
variance of an estimate of total based on a simple random sample of size 
n (= r) is the usual binomial variance with finite population correction. 
The term, F, the design effect, represents a factor by which variance is 
increased due to the effect of such factors as sampling procedure at each 
stage, the extent of stratification and post-stratification, size of 
units at various stages and clustering of counts of the characteristic 

in the province. It may be noted that stratification and post-strati- 
fication usually reduce the variance and clustering increases variance 


of an estimate. 


In general, design effects tend to be greater than one due to clustered 
sample design of the LFS. The labour force status categories such as 
"employed'', ''unemployed'' by age-sex groups tend to have lower design 
effects due to post-stratification by age-sex which decreases their 


variance. Those for labour force status by particular industry tend to 
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be large due to their location in specific areas. Design effects are 
known to be related to measures of homogeneity and average size of 
clusters. Models expressing their relationships have been developed 
for many surveys. In a study on components of variance in the LFS 
the design effects and measures of homoegeneity have been analyzed 


for a number of characteristics [2]. 


A measure of precision of estimates which is independent of the level 


of the estimate and the scale is coefficient of variation. The cv (X) 


is given by 
ev(x) = VFWwI) -D (2) 


By taking logarithms to base e on both sides of (2) we have an equation 
relating CV, X and P given by 

166, CVOOE ae ogee eon ate ocala) (3) 

g 2 2 Peas p’* 

Because of the third term on the right, the equation (3) is not linear 
in log CV and log X, even if F(W-1) is assumed constant. However, for 
small values of X the contribution of the third term is negligible. A 
model based on (3) is given by 


A 


log CV(X) = A+B log X +e, (4) 


where A and B are parameters of the model and « is the error term. Tre 
estimate of parameter B will differ from - x depending on the extent to 
which B log X approximates 5 log [X/(1 - *)] over the range of X. In an 
evaluation of fits of (4) and of an alternative model (5) given by 


log cv (x) = A +B log ut NLD E\, (5) 


the goodness of fit for the two models as shown by R2, the Fat hosor 


regression sum of squares to total sum of squares, was found to be 


calle oka be 


quite close. The model (4) is linear in log X and log CV and is simpler 
than model (5). 


A non-linear model corresponding to (4) is given by: 
GV(X) =A X. *# &; (6) 


where A' and B’ are parameters of the model and € is the error term. The 
two models (4) and (6) were fitted to data on monthly estimates and their 


CVs for 90 characteristics in each of 10 provinces and Canada. 


3. GROUPING OF CHARACTERISTICS 


The monthly design effects of LFS estimates for January-December 1980 for 
each of 90 characteristics excluding total population for each province 
and Canada were averaged and plotted to decide the ranges for the two 
groups. In each province, the first qroup consists of characteristics 


with design effects greater than D. 


Table 1 shows the boundary values D for group | and II in each province 
and at Canada level, and the number of characteristics in group II. The 
grouping of characteristics was done by arranging characteristics in 
increasing order of average design effects. The boundary value D was 
selected so that the assumption of equal design effects was satisfied <s 
far as possible in group |. The second group consists of al] remaining 
characteristics where the assumption of equal design effec:s is more crude. 
Most characteristics pertaining to labour force status by ige-sex grours 
fall in group |. ''Employed by industry'! and ''duration of unemployment! 
mostly fall in group II. The average design effects diffe- substantially 
between provinces and for Canada. More refined grouping o ~scharacter he- 


tics on the basis of models for design effects is being investigated. 


It may be noted that about 80% of the characteristics in eich province 


and for Canada, have been classified in group |. For obta ning 3 
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conservative estimate of CV for a new characteristic models based on 
group I1 can be used. For a characteristic for which monthly estimates 
of CV are routinely produced the models for the group in which the 
characteristic falls, can be used to obtain approximate estimate of CV 


with a greater precision than that based on monthly data. 


In the following section the assumptions made in fitting the models (4) 


and (6) are explained and model fits are evaluated. 


4, EVALUATION OF MODELS 


The basis of fitting the log-linear model (4) is to treat the model as a 
simple linear regression model in y = log cv (x) and x = log X and to 
obtain estimates of parameters A and B in the linear regression framework. 
The usual assumptions of independence of errors and constant variance 

have been made. Under these assumptions, Ro provides a measure of fit of 
the model. The values of the estimated parameters and coefficients of 
determination, Re, for group | and I! in 10 provinces and Canada are given 
in Table 2. The actual fitting of these models was done by using SAS 
utility. 


All Ro values are significant and quite high indicating that the fits are 
very good. The error plots do not show any patterns to conclude that the 
assumption of constant variance is not satisified. Under these assumptions 
and normality of errors CV (x) has a loa-normal distribution with constant 


CV for any value of X. 


The non-linear model (6) was fitted by Gauss-Newton method using SAS 
utility. The initial values of parameters A’ and B' were assumed to be 
1.00 and -0.50 respectively. The number of iterations required to reac 
convergence was at most 8 for each province and Canada, the convergence 
criterion being that the relative difference between successive error sum 
of squares is less than 1078, Table 3 shows values of estimated parameters 
and errors sum of squares for Canada Group II. The errors are approxi- 


mately normally distributed as shown by normal probability plots. 
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Since it is of interest to compare the fits of the non-linear model for 
provinces, Canada and the two groups it is necessary to have a criterion 
of goodness of fit. In the non-linear model, the total sum of squares is 
not equal to the total of regression and error sums of squares. A 


criterion R'2 can be defined as 


N y 
zx (y. - Gs 
a ead 
N 2 
SY? a ¥) 


— 
— 


a 


where Y.'s are estimated CVs based on the model, Y,'s are observed CVs 
and Y their mean. The summation extends over N, the number of charac- 
teristics in the group multiplied by 12, the number of months. In the 
linear case R? = Riz, However, in the non-linear case Ro # R'2 since 

the total sum of squares is not equal to regression sum of squares plus 


error sum of squares due to product term not being zero. 


Tmererrors.(y.  - Y.) will be small when the fit is good giving a value 

of R'* close a 7, rhe errors (Y. - Y,) will be large when the fit is 

poor giving a small value of R'2, When all the points lie on the fitted 
curve i.e. Y; = FOr al lek, R 2 = 1. However, in general no lower bound 
to a seems to exist. The values of Re shown in Table 4 tend to be 
greater for group | as compared to group II, which has [3 to, 2" characte> 


ristics out of the total of 90. 


Although the log-linear model (4) was fitted to data on logarithms of 
estimates and their CVs and its fit seems to be good, the fitted models 

for provinces and Canada are used for estimation of CV of «stimates. 11 
order to compare the fit of the transformed model to origiral data of 
estimates and their CVs, these data and the transformed mocdel corresponding 
to (4) were plotted for the two groups in 10 provinces and Canade. Fron 
these charts it can be concluded that the transformed mode! corre spondi 1g 


to (4) fits the data of estimates and their CVs better thar the non-linzar 
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model (6), especially for small values of estimates. The plots of these 


models for Canada group I! are shown on Chart 1 and 2. 


5. CONCLUDING REMARKS 


The characteristics considered are total persons with labour force status 
by age-sex, industry, marital status and total persons with various ranges 
of duration of unemployment. However, the models can also be used for 
proportions instead of totals. The models are not applicable to estimates 
for subprovincial areas such as urban centres or groups of economic 
regions, since design effects for these areas are more unstable and can be 
much higher due to the effect of ratio-adjustment based on projected popu- 


lation at province level [1]. 


An assumption made in the use of models for a new characteristic is that 
its design effect is close to the average for the group. This requires 
finer grouping of characteristics of various types possibly on the basis of 
models relating design effects with measures of homogeneity for these 
characteristics. In fitting the models, it was assumed that errors are 
uncorrelated and that independent variable is fixed. Since twelve monthly 
estimates for each characteristic were used, there could be correlation 

in errors for estimates for a given characteristic. Extension of the 

study to models with errors in independent variable and correlated errors 


is being considered. 


A problem in evaluation of fit of non-linear models, whether actually fit- 
ted to data or transformed from linear models, is the lack of a criterion 
for comparison of fits of different models. The criterion suggested in 
section 4 may be appropriate for comparison of fits of a model to different 


data sets, but may not work for different models. 
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TABLE 1: DESIGN EFFECT BOUNDARY VALUES AND NUMBERS OF CHARACTERISTICS 
J IN GROUPS | AND I1* 


Boundary Number of 

Province Value (D) Characteristics 

Group | Group || 
Newfoundland 7555) 75 15 
Pails soe} 13 17 
Nova Scotia 1.9 74 16 
New Brusnwick oe 2 a. 13 
Quebec 503) We: 17 
Ontario 7 69 21 
Manitoba 2.0 | 76 14 
Saskatchewan 2.8 76 14 
Alberta Ze 7) 19 
British Columbia 2.3 is 17 
Canada ys) 7/ 13 

: A characteristic belongs to Group | if its design effect (averaged 


over the 12-month period from January to December 1980) is less than 
or equal to the boundary value D. If the average design effect is 


greater than D, then the characteristics is in Group I. 
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TABLE 2: REGRESSION COEFFICIENTS AND Re FOR LOG-LINEAR MODEL 


Regression Coefficient 


Province Group A B Re 
oe Le ae ee ee ene eee 
Newfoundland | oie alse) ONS 725' 0.9534 
VI Bio / >i) -0.6101 DESIST 
Poles | 2.7962 -0.5617 0.9485 
11 3.1796 -0.5885 0.8887 
Nova Scotia | 3.4612 -0.5837 0.9702 
11 3.6412 0.5257 OLo7 17 
New Brunswick | 3.2782 -0.5545 0.9606 
| 3.7544 -0.6017 0.9357 
Quebec | hk 3298 -0.5942 0.9686 
1 hk 3093 -0.5216 0.9127 
Ontario i 3825 -0.6053 0.9736 
| 4.1796 -0.5009 0.9633 
Manitoba | 3.5155 -0.5926 0.9619 
1 3.8769 -0.5640 0.9166 
Saskatchewan | 3.3796 =05 5700 0.9544 
| 3.5478 -0.4423 0.8994 
Alberta | 3.6960 -0.5968 0.9678 
I 3.7526 -0.5090 0.9513 
BaGe | 3.9847 =0.5750 0.9621 
II 3.9814 -0.4708 0.8410 
Canada | 4 3458 -0.5936 . 0597/03 
1 | 2357 = ORNs 0.9699 


TABLE 3: 


Iteration 


0 
| 
2 


3 


IN B! Residual S.S. 

ccd a a ar 
.00000000 -0.50000000 3401.93232121 
22076853  -0.23647629 461.76322678 
47981387 = -0. 36743343 322.67707190 
94184546  -0.51147529 248 .68405130 
29455529 -0.47434886 99. 32440727 
32558100 -0.48419609 96 .57832290 
28627964  -0.48409502 9657810754 
28746710 -0.48409960 96 .57810746 


NON-LINEAR LEAST SQUARES: 


1 


15. 
26. 


5] 


5/. 
58. 
58. 
58. 
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CANADA (GROUP 11) 


GAUSS-NEWTON METHOD 
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1 
TABLE 4: R 2 FOR GROUP | AND II 


= * 1) ER COLE Sas 
Province Group N Roa. =! TORI EOESs 
Newfoundland | 866 0.9362 
1 | 190 0.8835 
Peel | 827 0.8925 
| 294 0.7285 
Nova Scotia i 872 0.9790 
11 192 0.7813 
New Brunswick | 908 0.9990 
V1 156 0.8639 
Quebec | 859 0.9800 
11 204 0.7804 
Ontario | 823 0.9632 
1] Jes pe 0.9208 
Manitoba I 895 0.9691 
| 168 0.8137 
Saskatchewan | 896 0.9436 
1] 168 0.8196 
Alberta | 845 0.9701 
tl 228 0.8852 
BaGe | 868 0.9319 
1] 204 0.7786 
Canada | 923 0.9665 
1] 156 0.9286 


“ N for group I can be less than 12 (no. of 
characteristics) due to exclusion of characte- 
ristics with zero estimates. 
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SURVEY METHODOLOGY 1982, VOL. 8 


THE ROLE OF THE QUESTIONNAIRE IN SURVEY DESIGN 


R. Platek and D. Royce ! 


The modern statistical survey is an effective method of meeting 
the ever-increasing demand for timely and accurate data. One 
important component of the statistical survey is the question- 
naire. This article discusses the role of the questionnaire in 
meeting the needs of users, the relationship of the questionnaire 
to the other components of survey design, and the effect of the 
questionnaire on the quality of survey data. The importance of 
viewing the questionnaire as an integral part of the total survey 
design is stressed. 


1. INTRODUCTION 


The escalating demand for appropriate and timely information of various kinds 
and from various sources calls for an organized approach to the entire process 
of data collection. The past forty years have seen the emergence of the sta- 


tistical survey as an important tool to meet this need. 


One important component of the statistical survey is the questionnaire. In the 
sections which follow, we describe the role of the questionnaire in meeting 
information needs, the relationship of the questionnaire to the other compo- 
nents of survey design, and the effect of the questionnaire on the quality of 
survey data. Although the discussion is presented mainly in the context of 
the household survey conducted by personal interview, many of the comments are 


relevant to questionnaires and surveys of all types. 
2. INFORMATION NEEDS AND THE ROLE OF THE QUESTIONNAIRE 


The simplest definition of a questionnaire is that of a group or sequence of 
questions designed to elicit information upon a subject from a 
Tespondent. Within the range of techniques in questioning, the questionnaire 
May range from a list of undefined topics to a highly structured set of 


questions with no options for response other than those listed. 


TR. Platek and D. Royce, Census and Household Survey Methods Division 
Statistics Canada. 


The questionnaire plays a central role in a complex process (the interview) in 
which information is transferred from those who have it (the respondents) to 
those who need it (the users). The questionnaire is the means through which 
the information needs of the users are expressed in operational terms which 
can be presented to a respondent in such a way that he will supply the 
required information. For this transfer of information to be effective, the 


questionnaire must meet the requirements of both users and respondents. 


The expression of information needs, which a user may initially only vaguely 
understand, in terms suitable to the respondent is not something that can be 
accomplished in one step. Instead, the questionnaire design evolves and is 


refined as part of the overall survey development process. 


For example, the user may begin with a need for information on "the housing 
conditions of the poor". He develops this into survey objectives by asking 


questions such as: 


(a) What is the problem we are trying to solve? 


(b) What specific items of information are needed? 


(c) How will the information be used? 


(d) How accurate and timely does the information have to be? 


In answering these questions, his thinking becomes more quantitative, and he 
expresses his information needs in terms of specific survey concepts. The 
survey concepts describe both what is to be measured and the units for which 
measurements are required. He may describe "housing conditions" in terms of 
the number of rooms, the presence of plumbing and electricity, or the state of 
repair of the dwelling. He may define "the poor" in terms of income level or 


in terms of assets and debts. 


It is important to emphasize that specific question wording is not at issue in 


the development of survey concepts. The first step for the user in expressing 


his information needs is to decide what should be measured, not how it will be 
measured. The user should choose the concepts based on their relevance to his 
information needs. He should consider, for example, what concepts are most 
appropriate for the uses to be made of the data and whether the concepts are 


compatible with other sources of information. 


Once information needs have been expressed in terms of specific survey con- 
cepts, the questionnaire becomes the instrument by which these concepts are 
measured. Through specific questions and accompanying instructions, the user 
specifies precisely how the survey concepts are to be measured in operational 
terms. Several questions may be required to measure complex concepts. In the 
Canadian Labour Force Survey, for example, as many as ten questions are needed 


to measure the concept "unemployed". 


The questionnaire often serves as the document for recording of measurements 
as well. This is mainly of benefit to the interviewer or respondent, since it 
is convenient to record the answers immediately following the question. In 
theory, however, there is no reason why the questions and answers cannot be on 


two separate forms. 


In the more structured types of surveys, the questionnaire is an important 
method of standardizing and controlling the data collection process. ie Ineasta— 
tistical surveys, in contrast to other methods of investigation, the 
researcher usually cannot do his own data collection but must rely on inter- 
viewers hired for the job. Without specific question wordings and instruc- 
tions to follow, interviewers would inevitably change the meaning or emphasis 
of questions and quite possibly the responses. The questionnaire helps ensure 
that the researcher measures what he wishes to measure with every respondent. 
It is, in effect, a "program" for the interviewer and respondent to follow in 


order to produce the desired result. 


The questionnaire cannot be too rigid, however. It must be flexible enough to 
adapt to respondents of different age/sex groups, languages and social back- 
grounds. Different words or groups of words may be needed in order to convey 


the desired meaning to all respondents. The questionnaire must also 


anticipate all of the possible answers that could be given. This is especially 
true in the initial, exploratory stages of research where an unstructured 


collection of data may be the most appropriate approach. 


It must be recognized that the questionnaire is a complex and often imprecise 
measuring instrument. The subjects of measurement are human beings, and the 
process of measurement is based on language. As well as being a measuring 
instrument, the questionnaire is also a form of communication involving the 
researcher, the interviewer and the respondent. It transmits a request for 
information to the respondent, and it transmits the respondent's answer back 
to the researcher in a form useable to him. Warren Weaver, in The Mathematic- 
al Theory of Communication (1949), identifies three problems that must be 


faced in the design of any communication system: 


A. How accurately can the symbols of communications be transmitted? (The tech- 


nical problem). 


B. How precisely do the transmitted symbols convey the desired meaning? 


(The semantic problem). 


C. How effectively does the received meaning affect conduct in_ the 


desired way? (The effectiveness problem.) 


All three problems are directly relevant to the construction of question- 
naires, and all three problems are closely linked. Within the context of 
statistical surveys, the way in which the questionnaire solves these problems 
plays a major role in determining how well the information needs of the user 


are met. 
3. THE QUESTIONNAIRE AND THE COMPONENTS OF SURVEY DESIGN 


The process of making the survey concepts operational in a specific document 
forces the researcher to consider not only question wording, sequencing and 
layout, but nearly every other aspect of the survey as well. The question- 


naire design must take into account elements such as the type of population 


being surveyed, the sample design and sample size, the subject matter of the 
survey, the interviewing method, the data processing techniques to be used, 


and the budget and time available. 


Figure 1 illustrates the questionnaire's relationships to some of the other 
elements which make up the total survey design. These interrelationships form 
a complex network; changes to one component of the design often require 
changes in several other components as well. Virtually any component of survey 
design could be placed at the centre of this network, but for the purpose of 


discussion we have chosen to focus attention on the questionnaire. 


Elements such as the type of population, the sample design and the required 
level of accuracy are closely interrelated with questionnaire design. For 
example, the heterogeneous nature of many survey populations results in a need 
for cross-classified data. These needs affect the sample size, the type and 
degree of stratification, and the reliability of the information. This’ ain 
turn will affect the questionnaire through the types of questions asked and 
the level of detail requested. This will further have an effect on the cost 


and timeliness of the information, the amount of respondent burden, and so on. 


The questionnaire design is closely linked to the method of data collection 
and the survey's subject matter. Each method of data collection, such as per- 
sonal interviewing, telephone interviewing and mail surveys, creates its own 
Survey conditions which may be more or less appropriate to a given subject 
matter. These conditions will in turn affect the questionnaire's style of 
questioning, content, format, length and so on. In personal interviews, for 
example, it is often possible for the interviewer to collect certain data, 
such as type of dwelling and sex of respondent, by direct observation rather 
than questions. In addition, the questionnaire can be designed for the use of 
flash cards or other visual aids by the interviewer. The element of face-to- 
face communication is also a powerful motivating factor for the respondent. A 
personal interview is often the only choice when a complex, long and demanding 
questionnaire is involved. In telephone interviews, much of the social inter- 


action between interviewer and respondent is lost and the respondent's 


Figure 1: Elements Affecting the Questionnaire 
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co-operation may be affected. The questionnaire must rely entirely on verbal 
communication for its success, and the subject matter may have to be less 
demanding. However, with certain’ sensitive surveys, (e.g. criminal 
victimization surveys), the extra distance between interviewer and respondent 
may actually make it easier to answer questions. In mail surveys, the 
questionnaire itself assumes the role of interviewer. It must introduce the 
survey, motivate the respondent to co-operate and guide the respondent in 
completing the interview. It is a particularly demanding role which must be 


taken into account in designing the questionnaire. 


Whether the survey is one-time or continuing also has an effect on 
questionnaire design. With a continuing survey, there is often more scope for 
learning from experience and refining the questionnaire over time. Experiments 
in question wording, programs to monitor response errors, and other methods of 
evaluating and improving the questionnaire design may only be feasible with a 
continuing survey. However, the ability to improve a questionnaire must be 
balanced against the disadvantages of change: for example the inability to 
make comparisons over time, the necessity to retrain interviewers, and the 


necessity to change expensive computer software. 


In many continuing surveys, such as the Canadian Labour Force Survey, the same 
Tespondents are interviewed several times. The questionnaire must take into 
account the total response burden during the respondent's stay in the survey. 
The questionnaire may also have to adapt to different collection methods: for 
example in the LFS the first interview is conducted in person while in urban 
areas most subsequent interviews are conducted by telephone. Questionnaires 
designed for continuing surveys must be developed with the longer term view in 


mind. 


The questionnaire is also interrelated with data processing and budgetary con- 
cerns. The format of questions, for example open or closed, has direct impli- 
Cations for operations such as coding, data capture, editing and tabulations. 
The presence of many open-ended questions increases the time and effort during 
coding operations, and the programs to edit and tabulate the data become more 


difficult and costly to write and test. 


The questionnaire as an operational expression of user needs thus involves the 
total survey design itself. Survey design is a combination of intricate com- 
ponents, among which the questionnaire plays a central role. The questionnaire 
neither determines the form of the other components, nor is its form deter- 
mined by the others. The process of questionnaire design must flow from and 


be a part of the total survey design process. 
4. THE QUESTIONNAIRE AND ERRORS 


All survey-taking is subject to errors from various sources, and in recent 
years non-sampling error has received increasing attention as a major compo- 
nent of the total survey error (see, for example, Anderson et al (1979) 
Bailar (1976), Hansen, Hurwitz and Bershad (1961), Koch (1973), and Platek and 
Singh (1980)). The control of non-sampling errors is an integral and vital 
part of survey design, requiring specific programs for the diagnosis, measure- 
ment and prevention of errors. Further, each program will have its own costs 
and benefits which must be taken into account in the design of controls 


(Platek and Singh (1980)). 


The questionnaire is both an important source of non-sampling errors, and an 
important part of programs for their prevention and measurement. The scient- 
ific development of data collection has lagged behind that of sample design 
and estimation; improvements in sampling techniques often deal in fractions of 
a percent while experiments in question wording may reveal variations of 20 
percent or more (Payne (1951)). This section discusses the relationship of 
the questionnaire to a few of the more important sources of non-sampling 
errors and illustrates the role of the questionnaire in minimizing these 


errors. 
4.1 Non-response errors 


Non-response is one important source of non-sampling error. If the character- 
istics of interest differ from respondents to non-respondents, bias will 
almost certainly be introduced into the results. Non-response is basically of 


two types: the "no contact" type, (e.g. no one home, temporarily absent, bad 
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weather, etc.) and the "refusal" type. The latter may be either a complete 
non-response or only non-response to some questions. The questionnaire can do 
little to eliminate the "no contact" type of non-response but it does play an 


important role in preventing the refusal. 


To understand how the questionnaire does this, it is important to first under- 
stand why respondents do or do not respond. Many different psychological 
forces motivate people to respond to surveys, including an interest in the 
topic, a desire to be helpful, a belief in the importance of the survey, a 
feeling of duty, or even a belief in their own importance. Other forces 
influence people to refuse: for example difficulty in understanding questions, 
fear of strangers, the feeling of one's time being wasted, difficulty in 
recalling information, and embarrassing or personal questions. All of these 
forces will have an effect on the questionnaire design through the way in 
which survey topics are introduced, the question wording, the questionnaire's 
appearance and length, assurances of confidentiality, and so on. At the same 
time, these forces interact with the survey's subject matter, the type of pop- 
ulation and the data collection method, which in turn influence the design of 


the questionnaire. 


One must also consider the ability of respondents to respond. Unrealistic 
demands on the respondent's knowledge or memory, the use of overly difficult 
and technical language, or excessive demands on the respondent's patience are 
all sources of non-response which have their roots in the questionnaire. It 
must be said, however, that the patience of respondents often amazes even 
hardened survey designers. Chinnappa and Wills (1978) describe an interesting 
study of non-response to the physical measures component of the Canada Health 
Survey, where respondents were asked to submit to blood pressure: tests skin= 
fold measurements, exercise tests, and were even asked to donate blood 


samples. 


A more thorough discussion of the causes and treatments of non-response is 


given in Platek (1980). 


ie 


4.2 Response errors 


Response errors are a second category of non-sampling errors to which the 
questionnaire is closely related. Response errors can occur anywhere during 
the question-answer-recording process, and may be either systematic (response 


bias) or random (response variance). 


Questions on sensitive topics, such as amounts and sources of income, use of 
alcohol and tobacco, illegal activities or mental illness are subject to large 
response errors. It is often felt, for example, that the respondent may 
distort the answer to avoid embarrassment or to appear to conform to societal 
norms (Warwick and Lininger (1975)). Many questionnaire design techniques 
have been devised to counter this "social desirability bias", including the 
anonymous questionnaire, the use of projective questionning techniques, ! or 
randomized response techniques in which the respondent chooses which of two 
(or more) questions he answers by the random choice. However, in -a.recent 
study which compared questionnaire responses to external criterion information 
(e.g. official records or test results), Marquis et al (1981) found, rather 
surprisingly, that for most items which they studied the response bias was 
almost negligible, but that the response variance was quite large. This 
conclusion, if supported by other studies, indicates that measuring and 
reducing response variance may also be important in sensitive topic surveys. 
This might involve techniques such as reinterviews, internal consistency 
checks during the interview, and the collection of other information 
correlated with the variables of interest. This kind of emphasis has direct 


implications for questionnaire design. 


Questions which depend on the respondent to remember events, such as the 
taking of a trip or the occurrence of a crime, are another source of response 
errors. Events may be forgotten, or events which occurred before the refer- 


ence period may be incorrectly included.  Bushery (1981), in an experiment 


1 An example of projective questionning might be the sequence: 
1. What do you think most people feel about smoking marijuana? 
2. How do you yourself feel about it? 
The first question asks for the respondent's view of the societal norm 
and the second asks for his own view. 
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with the U.S. National Crime Survey, found that victimization rates with a 
3-month reference period were much higher than those reported under a 6-month 
reference period, which were in turn higher than the victimization rates 
reported with a 12-month reference period. The bias due to recall loss with 
the longer reference periods was a much more serious source ofvrerrorzethan 
sampling variability. The choice of an appropriate reference period for ques- 
tions involving recall has been examined in a number of different sub ject 
matter areas (Sudman (1980), National Center for Health Statistics C1972 ie 
Bounded recall, where respondents are interviewed at the end of the reference 
period, or the use of prominent dates (e.g. Christmas) and calendar aids to 
Jog respondents' memories have shown to be of some value in reducing under- 
reporting (Neter and Waksberg (1965), Ashraf (1975)). With some topics, 
however, the only possible way to collect the information is to make the ques- 
tionnaire into a form of diary, where the respondent records the event during, 
or shortly after, it happens. Questionnaires of this type are used for the 


Food Expenditure Survey and the Fuel Consumption Survey of Statistics Canada. 


Although questions demanding recall and sensitive topics are important sources 
of response errors, there are many other causes. For example, an important 
component of response error is that due to the interviewer, the so-called 
correlated response error. Each interviewer exerts, to some degree, a common 
influence on all of the respondents in his/her assignment through the way in 
which the questions are asked, they way in which the respondent's replies are 
interpreted and recorded, and so on. The contribution of this component of 
error to the total survey error is directly related to the size of the 
interviewer's assignment. In telephone surveys, which may have quite large 
assignments, the correlated component can be a much more serious error than in 
personal interviews (Groves and Kahn (1979)). In turn, the correlated 
response error is more serious in personal interviews than in mail surveys or 
other surveys of the "self enumeration" type. This consideration was a major 
Teason why the Census of Population and Housing has adopted the 
drop-off-mailback as the standard technique since 1971. The choice of data 


collection method in turn has a direct influence on the questionnaire design. 
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Numerous other examples of response errors could be given. They depend on 
what question is asked, how the interviewer asks it, the way in which the 
respondent interprets and answers the question, and the way in which the 
interviewer interprets and records the answer. The interview is a dynamic, 
interactive process of communication between interviewer and respondent. How 
it is handled determines whether or not the interview produces the desired 
information in an accurate and efficient fashion. In the heat of the inter- 
view, it is the questionnaire, through its content, question wording, instruc- 


tions and layout, which must play the major role in controlling the situation. 


4.3 Data processing errors 


Once the interview is completed, the questionnaire becomes primarily a data 
processing document. [Errors can occur at all phases of processing including 
coding, data capture, editing, imputation, estimation and tabulation. The way 
in which the questionnaire was designed will have a significant impact on the 


number and type of errors at this stage of the survey. 


By including data capture codes right on the questionnaire, for example, data 
capture errors are usually reduced Significantly. The data are captured 
directly from the questionnaire without first being transcribed onto another 
form. A step beyond this is the Computer Assisted Telephone Interview. The 
questionnaire is stored in a computer program, which controls the entire 
interview process. The questions appear one at a time on a video display 
terminal in front of the interviewer, who then asks the question and types the 
respondent's reply directly into the computer. The data can be edited immed- 
iately and errors corrected while the respondent is still on the telephone. 
The process also reduces the incidence of questions missed or of incorrect 


application of skip instructions. 


Editing and imputation errors are also closely related to the questionnaire 
design. Problems of missing or inconsistent data can often be traced back to 
faulty questionnaire design. The ability to reconstruct or impute for missing 
values often depends on what concomitant variables were included on the ques- 


tionnaire and what kind of fail-safe mechanisms were built in. For example, 
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in a survey which requests information on several detailed components of 
income, cases where the information is not given or is incorrect can often be 


salvaged by including a question asking for total income. 


Non-response errors, response errors, and data processing errors are a few of 
the non-sampling errors which are closely linked to the questionnaire and to 
the other components of the overall survey design. The questionnaire is inev- 
itably a cause of non-sampling error, but it must also go as far as possible 
in preventing errors. The degree to which the questionnaire succeeds at this 
task depends largely on the survey designer's knowledge of the various sources 
of errors and on his skill in integrating the design of the questionnaire with 
that of the entire survey. Each new survey may present new problems and pit- 
falls and as such they must be anticipated and taken into account in develop- 


ing questionnaires. 


5- CONCLUSION 


The preceding sections have illustrated the questionnaire's role as both an 
expression of the user's information needs and as an important determinant of 
the quality of survey data. In both roles, the questionnaire is closely 
linked to all of the components of survey design. The total survey design, 
and in particular the questionnaire, must try to maximize both the relevance 
of the data to the user and the accuracy of the data. Successful quest ion- 
naire design incorporates both; we must ask the right question, and we must 


ask it in the right way. 


It is important to underline that users' needs and the requirements of accur- 
acy often conflict. The process of questionnaire design involves tradeoffs. 
A user may have to ask a simpler question than he would like simply for the 
respondent to be capable of answering. On the other hand, the questionnaire 
designer should not avoid asking complex questions Simply because the answers 


may contain errors. 


a, He 


Questionnaire development is not simply a laboratory process. Although guide- 
lines exist and research is possible, the skill of questionnaire design is 
learned to a large extent by practical experience and by trial and error. It 
is learned through discussions with users, interviewers and respondents. 
Questionnaire design is undoubtedly an interactive process which cannot be 
carried out in isolation and independent of other factors in survey develop- 
ment. It interrelates with them and, in fact, it forms an integral part of 


the total survey design. 
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EVALUATION OF SMALL AREA ESTIMATION TECHNIQUES 
FOR THE CANADIAN LABOUR FORCE SURVEY ! 


J.D. Drew, M.P. Singh, G.H. Choudhry 2 


Estimates from sample surveys are sometimes required for domains 
whose boundaries do not coincide with those of design strata. 
Taking the Canadian Labour Force Survey as an example of a survey 
utilizing a clustered sample design, some alternative small area 
estimation techniques available in the literature are evaluated 
empirically including synthetic, domain (simple and_ post- 
stratified) and composite estimators which are linear combina- 
tions of synthetic and post-stratified domain estimators. A 
sample dependent estimator which attaches weight to the post- 
stratified domain estimate depending on the amount of sample in 
the domain is proposed and its performance is also evaluated. 


1. INTRODUCTION 


With increasing emphasis on planning, administering and monitoring social and 
fiscal programs at local levels, there has been demand for more and good 
quality data at these levels from various municipal, provincial and federal 
government departments as well as from private institutions. The type of data 
fequired ranges from simple population counts to complex socio-economic 
variables such as employment, unemployment, income, houseing, proverty 
indices, health conditions and facilities etc. However, until recently not 
much attention had been paid to the development of sound statistical estima- 
tion techniques for small area data, with the notable exception of statistical 
demographers who for some time have been investigating the particular problem 
of small area population estimates, and who have identified several competing 
methods based on the use of administrative data and other sources. 
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A comprehensive review of existing small area (domain) estimation techniques 
along with their limitations is given by Purcell and Kish (1979). From the 
research done to date it is clear that there is not a unique best solution to 
the small area estimation problem. The choice of a particular method for 
small area estimation will depend on the data needs and on the richness and 
availablility of data sources, which differ from country to country, and 
within countries from one subject matter to another. Therefore, the classifi- 
cation of the type of small areas (domains) and examination of the data 
sources available in a particular context, followed by thorough investigation 
of the alternative small area estimation techniques for given situations, 
seems to be the most appropriate approach to development of small area data. 
In this context, we shall use the following classification of domains 
suggested by Purcell and Kish (1979) and point out the type of domain to which 


developments in this articles primarily refer. 


(a) Planned domains - for which separate samples have been planned, designed, 
and selected. In the Canadian context, such domains for example may be 


economic or planning regions within a province or the province itself. 


(b) Cross Classes - which cut across the sample design and the sample units 
(may also be referred to as characteristic domains); e.g., age/sex, 


occupation, industry. 


(c) Unplanned Domains - that have not been distinguished at the time of sample 
design and thus may cut across the design strata or the primary sampling 
units (PSU's) within the strata. Examples of these in the Canadian 
context include Federal Electoral Districts, and Census Divisions or sub- 


divisions, counties and manpower planning regions. 
It should be noted that both types (a) and (c) refer to areal domains. 


We consider this distinction of the domains into the above types important 
Since the form of the estimator as well as its efficiency would depend upon 
the particular type of application. As pointed out by Purcell and Kish most 


of the developments in small area estimation techniques in the United States 
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and elsewhere have concentrated on the domains of types (a) and (b). In 
Canada however, type (a) and (b) domains are not so problematic due to the 
type of design and the sizes of the national surveys, and the main emphasis 
has been on the data for the domains of type (c), with the possible exception 


of population counts using symptomatic data. 


Investigations into the application and evaluation of small area estimation 
techniques for variables other than population started with the publication of 
synthetic estimates from the National Center for Health Statistics (1968). 
Since then a series of investigations (Gonzalez (1973), Gonzalez and Waksberg 
(1973), Schaible, Brock and Schnack (1977), Gonzalez and Hoza (1978) and 
others) have been carried out usingdata from the Current Population Survey in 
the application and evaluation of a particular synthetic estimator. Using a 
synthetic estimator whose form is different, studies were carried out by 
Purcell and Linacre (1976) aimed at production of estimates for Census 
divisions in Australia and by Ghangurde and Singh (1976, 1977, 1978) in the 
evaluation of synthetic estimates in the context of Canadian Labour Force 


Survey (LFS). 


As remarked by Purcell and Kish (1979), the nature of the design in relation 
to the domains of interest has an important role to play in the choice of an 
estimator. The estimators considered in this article are geared to the 
Canadian LFS where the domains are unplanned domains (typec) and are of a size 
such that, had they been planned domains (type a), the reliability of regular 
unbiased survey estimates would be satisfactory without having to resort to 
small areas estimation techniques. Also in the LFS, primary sampling units 
are small (populations from 2,000 - 5,000) relative to the sizes of the 
domains of interest. This differs from the situation in the United States 
where the sizes of primary sampling units for most of the large scale surveys 
are larger, comparable in size to the small areas for which the estimates are 


desired. 


In this article estimators are evaluated in the context of producing Census 
Division level estimates from the Labour Force Survey, using data from the 


1971 and 1976 Censuses of Population and Housing in an auxiliary fashion. In 
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addition to synthetic estimators, we evaluate post-stratified domain 
estimators which were considered earlier by Singh and Tessier (1976), and 
composite estimators which are linear combinations of the synthetic and the 
post-stratified domain estimators, similar to those considered by Schaible 
(1979) and Schaible, Brock and Schnack (1977). Also we propose and evaluate a 
new estimator which we call a sample dependent estimator, which is of the same 
form as the composite estimator, except the weight given to the synthetic 
component is a decreasing function of the amount of sample falling into the 
domain upto a critical point after which the estimator relies totally on the 
post-stratified domain component. Efficiencies of the small area estimators 
relative to the direct (or simple domain) estimator for the characteristics 
employed and unemployed were obtained in an empirical (Monte Carlo) study in 
which the LFS design was simulated using census data. The situations where 
both the design and the auxiliary information are up-to-date and where both 
are out-of-date were considered. We have also evaluated the bias of synthetic 
estimators for the characteristics employed and unemployed for Federal 


Electoral Districts. 


2. DESCRIPTION OF ESTIMATION PROCEDURE 


Consider a finite population consisting of N units, (e.g. households or 
dwellings in household surveys), divided into L design strata labelled 1, 2, 
ospene TS. Sle The stratification has been carried out on the basis of 
geographic and/or certain socio-economic characteristics, and the sample 
allocation ensures certain precision for estimates from individual strata. 
The problem considered is that of estimating the total of an x-variate for all 
those unites belonging to an unplanned areal domain (type c). We denote by 
'a' the set of units belonging to the small area or domain of interest, thus 
the parameter to be estimated is the total of the x-variable in the domain 


"a', which we denote by aX’ 


Let a, be the set of those units belonging to the domain which are in 


stratum h, then 


= Deve 


1 (224) 
In practice the domain 'a' will have a non-null intersection with a certain 


number of design strata and if we denote by h the set of such strata, then we 


have 


h ch (222) 


The particular design under consideration follows a multi-stage clustered 


sample design which is self-weighting within each stratum with weight W 


for stratum h. : 
For a particular given sample we can obtain the quantities; 
tL = sample total of x-variate in stratum h, 
and 
Ae = sample total of x-variate in ay 
Mmmeti=|, 2, s.2, LU. Note that ath = “0 for h eehe Then the direct 


(or also referred to as design based or simple domain) estimator for the total 


of x-variate for those units in ‘a! Say 9X, is given by: 


R225) 


It should be noted that the direct estimator (2.3) does not utilize any auxi- 
Miary information - all it requires is the identification of those sampled 


units which belong to the domain. Due to the clustered nature of the design, 
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the sample falling in the domain may on occasion be very small or non- 


existent, generally resulting in high variance for this estimator. 


The other estimators in this section rely in different fashions on auxiliary 
information for a variable y, which is often taken as the count of persons by 
population sub-groups (defined on the basis of age/sex etc.) from a recent 


census. These estimators are: 


1) Post-stratified domain 
2) Synthetic 

3) Composite 

4) Sample Dependent 


Additionally estimators (2) - (4) rely to differing degrees on sample external 


to the domain. 


For each of the above estimators, the adjustments based on the auxiliary 
information can be made either be applying separate adjustments to each 
stratum intersecting the domain, or by applying an overall adjustment for all 
strata intersecting the domain. Thus the estimators will be further 
classified as separate or combined depending on the level at which the 
adjustment is made. These estimators are denoted by Xi where u is the level 


9 
of adjustment with values: 


j= ‘S¥s=Separate 


eC : combined 


and v is the type of estimaror taking the following values: 


v =p: post-stratified domain 
= s : synthetic 
ae composste 


= d : sample dependent 


a 


For example, Bee denotes the combined synthetic estimator, etc. 


oS 


2.1 Post-Stratified Domain Estimator 
pe ee er ee CIMacor, 
Define 


Y..= total of the auxiliary y-variable for population sub-group g in 


group g in stratum h, and 


Nf na total of the auxiliary y-variable for population sub-group g in a. 


Further let ne be an unbiased estimated of aha which would 
be formed analogously to the direct estimate defined in (2.1), except the 
characteristic being estimated in this case would be the auxiliary y-variable 
whose value is known for the set of sampled units (s) at some stage of 
sampling (whereas (2.1) is defined on the x-variate for the sample of ultimate 
units). In practice provided auxiliary y-variable information is available 
for them, sampling units at any stage down to the penultimate stage could be 


used. 


Then the separate post-stratified domain estimator (for which adjustments are 


applied at the stratum level) is: 


Y 
* a hg 
= Eee (Wab amercRe |) Mee 
arepr SG hidi hv y a,hq : (2.4) 
a hg 


where Ache is the sample total of the x-variate for population  sub- 


group g in the intersection of domain 'a' and stratum h. 


Similarly the combined post-stratified domain estimator (for which adjustments 


are applied at the domain level) is: 


Sites 


hth ena 
aecD Gy ttci aie alhg 
5 ey (25) 
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The post-stratified domain estimator is unbiased except for the effect of 
ratio estimation bias, provided a‘hg is obtained at the same time as 


a’ hg and using the same source such as census. 


Estimators of the above type have been considered earlier by Singh and Tessier 


(1976) with a different choice of post-strata. 
2.2 Synthetic Estimators 


We consider separate and combined synthetic estimators defined respectively as 


follows: 
s avh 
ene ey Weyer eee (2.6) 
ass ghn neh == A 
iy ta 
hth a'hg 
X = » yy (W ° 16, ) 9 (237) 
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where tL is the sample total for the x-variable for population sub-group 


g in stratum h. 


The above synthetic estimator has been considred by Purcell and Linacre (1976) 
and also by Ghangurde and Singh, (1976, 1977, 1978) who developed expressions 
for its variance and bias and evaluated the estimator using census data and a 
super-population model. A different form of syntheticestimator was proposed 
earlier by the National Centre for Health Statistics (1968) and investigated 
by Gonzalez (1973), Gonzalez and Waksberg (1973) and Gonzalez and Hoza (19755 
1978) using data form the Current Population Survey. : 
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The difference between the synthetic and post-stratified domain estimators can 
be readily seen by comparing (2.4) and (2.6). The post-stratified domain 
estimator uses only the sample falling into the domain Mise. 3 ite 
and the adjustment factor is the ratio of the true to the estimated values for 
the y-variable for the domain and hence can take on values greater than or 
less than 1 (its expected value being unity). On the other hand the synthetic 
estimator uses the estimate from entire strata intersected by the domain 
m.e. Wh ie for h ch) which is then deflated by adjustment factors specific 

to population subgroups. (i.e. the ratio of the y-variable for the domain to 


the y-variable for the entire stratum). 


The synthetic estimator will suffer from bias depending on the degree of 
departure from the assumption of homogeneity for the x-variate between the 
domain and the larger area, namely h, withing sub-groups of the y-variable. 
In defining the above synthetic estimator, the larger area was restricted to 
those strata which form part of the domain as it was believed that such a 
choice would lead to less bias. In general however, h need not be so 
restricted but it may include other neighbouring ares iach are believed to 
satisfy the homogeneity assumption. Bias and mean Square error of such esti- 


mators have been reported by some of the earlier referenced authors. 
2.3 Composite Estimators 


A composite estimator using the direct estimator and the synthetic estimator 
as the two components was suggested by Royall (1973) and others, and has been 
studied by Schaible (1978). Such an estimator minimises the chances of ex- 
treme situations (both in terms of bias and mean square error) and therefore 
may be preferred over either of its components. Synthetic estimators have a 
low variance by virtue of their use of data from a larger area to derive 
estimates for small are (domain), but for the same reason this introduces bias 
which could be quite large if as noted earlier, the assumption of homogeneity 
is not satisfied. On the other hand the simple domain estimator, which is un- 
biased, may have large variance particularly if the sample falling in the 
domain is very small. Empirical evidence of such relative performances of 


Synthetic and direct estimators are available from Gozalez and Waksberg (1975) 


eG re. 


Schaible, Brock and Schnack (1977), and Ghangurde and Singh (1977). The 
composite estimator considered here is obtained by replacing theyl damece 
estimator (2.3) by the post-stratified domain estimator which may be slightly 


biased but is generally more efficient than the direct estimator. 


The two types of composite estimators: namely, separate and combined are 
formed as linear combinations of the corresponding post-stratified domain and 


synthetic estimators; viz, 


a a a 


fee = a* sp pA % ) ass (2.8) 


and 


A “a 


ance 7) a*cp ee %) aes (2.9) 


The optimum values for a, and a, for minimum mse's are given by 


Mee aioth ene saccuodanl la*sp io a 


oe ae ema el la*sp spcoe 


X | 
(210) 


a) = 


a 


mse |X .|+ mse (|X. 


a ss p 


* 
and a similar expression for ap. 


Further, neglecting the covariance term in (2.10) under the assumption that 
this term will be small relative to mse [4X55] and mse Esp! then the 


* 
optimal weight a, can be approximated by 


QO] Eee (25419 


a A 


mse poral | + mse Liab 


KK 
with a similar expression for a , which was the approach to defining weights 
followed by Schaible (1978). 
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2.4 Sample Dependent Estimators 


In practice the true values of aj (or a%*>) used as the weight in the 
composite estimator will not be available as they involve population 
variances and covariances, which would have to be estimated from the sample. 
Further calculation of the covariance term in C2110) oan particular, may be 
quite complex and thus one may have to resort to an approximate value Cp Or 
a#*) which would require simply the estimated mse's of the two component 
estimators or an estimate of the ratio of the two mse's. In either case there 
estimates would introduce a certain amount of instability in the weight used, 


thus affecting the performance of the composite estimator. 


The sample dependent estimator (Drew and Choudhry, (1979)) which is a 
particular case of a composite estimator, depends on the outcome of the given 
Sample and is quite simple to compute. It is constructed using the result 
that the performance of the post-stratified domain estimator depends upon the 
proportion of the sample falling in the domain. If the proportion of the 
sample within the domain is "reasonably large' then the sample dependent 
estimator is the same as the post-stratified domain estimator, otherwise it 
becomes a composite estimator with gradual increasing reliance (in the sense 
of increasing weight) on the synthetic estimator as the size of the sample in 
the domain decreases. Thus the separate sample dependent estimator Gi. est 


constructed at the stratum level) is given by 


% a‘h ah 
es ee ee (eae ty Ww) ee i AGRE 
Sesd ihe inedupihqkana ta hoe 7 boi eh hg 
; Y Y 
a hg a hg 
where = 
Maes i Tf Sea alia oak ) 
Y 
= UbUsa hair, otherwise. 
K uf 
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Similarly the combined sample dependent estimator (i.e. constructed) at. the 


domain level) is given by 


hh a hg 
x? keefoes( ot Wd eve: b) x 
ascd,, ag dye Ike) Bh geaahg a 
ra hd a hg 
hth a‘hg 
TL (1-2 Som) Tay a? 10 Oh PE _Se te 8) (28939 
Casa dp. | hg ae 
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where 
6 er ye ly inl) duets ae) > K 
g hd ahg hm a hg fe) 


~ 


= ee ae ay 
Ko hh a hg/ heh a hag, otherwise 


The ratios 


~ ~ 


irom 7a Y. and Sn ee me, 
a hg ahg hed a hg hed a hg 


indicate the over- or under-representation of the population sub-group at the 
individual stratum or domain level with respect to auxiliary information for 


the y-variable, conditional upon the selected sample. 


Values of ratios greater than or equal to 1 signify that, conditional onthe 
given sample (s), the representation of the population sub-groups for the 
auxiliary y-variable is better than or as good as its unconditional 
representation had the domain been sampled independently at the same rate as 


the stratum. 


The value of Ko may be appropriately chosen. In this study the efficiency 
of sample dependent estimator has been investigated for two specific values of 


Ko namely 1.0 and 0.5. 


= 99. 


Holt, Smith and Tomberlin (1979) under the prediction approach derived an 
estimator (which relies on synthetic and direct estimates) where the weight 
attached to the direct component depends only on the sample falling into the 
domain. Sarndal (1981) proposed an alternative estimator in which the weight 
attached to the direct component depends on the sample in the domain relative 


to the sample in the larger area. 


3. DESCRIPTION OF THE EMPIRICAL STUDY 
3.1 Simulation of the LFS Design 


The LFS follows a multi-stage area sampling design (see Platek and Singh, 
(1976)). Within each of the 10 provinces of Canada, two principal area types 
are identified - the Self-Representing Units (SRU's) which correspond to 
cities generally of 15,000 or more population, and the Non Self-Representing 
Units (NSRU's) which correspond to smaller urban centers and rural areas. In 
the SRU's, cities are divided into compact areal strata with populations of 
15,000 each, within which a two stage sample of clusters (similar to blocks) 


and dwelling is selected. 


In NSRU's, Economic Regions, of which there are from 1-10 per province, form 
the starting point. These are stratified into 1-5 strata with populations 
from 30,000 to 80,000 using census data for 7 broad industryclassifications. 
Within strata, primary sampling units (PSU's) from 2,000 - 5,000 in population 
are formed. The second stage in the rural portions of PSU's corresponds to 
1971 Census Enumeration Areas (i.e., EA's), with populations of roughly 500, 
whereas in urban portions all urban centers are selected with certainty. The 


last two stages correspond to clusters and dwellings. 


In simulating the LFS design two cases were examined: (i) the case whereboth 
the sample design and the auxiliary information are up-to-date, and(ii) the 


case where both are out-of-date. 


I) 


For (i), the sample design, the auxiliary information, and the study variables 
were all based on 1971 census data. Counts of persons (15+) cross-classified 
by age/sex, and Labour Force status were retrieved at the EA level. In 
NSRU's, for each replication in the Monte Carlo study independent samples of 
primaries and secondaries were selected based on census population or dwelling 
counts. Within rural EA's and urban centers, the final two stages of sampling 
were simulated by random samples of persons. In SRU's, EA's comprising the 
areal strata were known, but there after the LFS design was independent of the 
census. Hence for the purposes of the study, EA's were randomly partitioned 
into ‘clusters’ having a size distribution corresponding to that for “Ets 
clusters. For each replication, a sample of ‘'cluster' and a random sample of 


persons within were selected. 
3.2 Choice of Population Sub-Groups 


The estimators defined in section 2 utilize auxiliary information for popula- 
tion sub-groups. Since the LFS is redesigned only decennially, it would be 
desirable to base the population sub-groups on information collected in the 
mid-decade as well as decennial census, so that the auxiliary information 
could be updated mid-way through the life of the survey. This ruled out such 
variables as industry or occupation, leaving various cross-classifications of 


basic demographic variables as the possible choices for population sub-groups. 


For the variables marital status, age and sex, the Automatic Interaction 
Detection (AID) procedure, due to Sonquist and Margan (1964) was used ona 
sample of census data from across Canada to derive optimal population 
sub-groups, separately for each Labour Force characteristic. Results ofthe 
AID analysis showed that for unemployed, no population sub-groups accounted 
for more than 2% of the variation, while for the characteristics employed and 
not in Labour Force the following sub-groups accounted for approximately 25% 
of the variation: (i) age 15-16 and 65+; (ii) age17-64, sex female; (iii) age 
17-64, sex male. Further splitting of these sub-groups did not result in 


Significant additional gains. 
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In addition to estimators based on the above population sub-groups, estimators 
based on total population 15+, and on dwelling counts were also considered. 
Dwelling count data were included due to the possibilities which exist for 
up-to-date dwelling information being available intercensally at the required 
level of detail. It might be noted that the estimators using population 15+ 
and dwelling counts are both special cases of the general formulation where 


the number of population sub-groups equals 1. 
3.3 Evaluation of Efficiency of Small Area Estimators 
a OSES NCAT ESC IMAvOLS: 


In the Monte Carlo study, we have considered 16 Census Divisions (CD's) and 11 
Federal Electoral Districts (FED's) in the province of Nova Scotia and 7 FED's 
from elsewhere in Canada. (There are altogether 18 CD's in the province of 
Nova Scotia, but two of 18 CD's correspond to complete LFS strata and 
therefore were omitted from the study). Due to the multi-stage nature of the 
design and larger number of domains in the study, the computational costs 


involved were high and it was decided to use only 100 replications. 


Census Divisions and Federal Electoral Districts, it should be noted, comprise 
networks of geo-statistical and geo-political areas respectively across 
Canada. There are approximately 300 of each, with the populations of Federal 
Electoral Districts being fairly uniform in the range 80,000 to 120,000, while 
those of Census Divisions, which often correspond to local levels’ of 


government or counties, vary greatly. 


We have reported results only for the 16 Census Divisions in Nova Scotia. 
Results were similar for other unplanned domains considered. 
If we let 3Xm(r) be the estimate of total ax “lee. ithe totaly tor tie 
X-variable for the domain '‘a') for the r'th replicate, for small area 
estimation method m, then the average mean Square error for the method m over 
the 16 domains in the study was calculated as: 

1 100, 


Wvgemse (m) =o. =r 2 ( Xx =o ky 100 = (Sen) 
16sa ors ~ a m(r) a 


= WE 


The efficicency of the small area estimator (m) relative to the direct 


estimator, say method mg was obtained as: 


Avg mse (mo) 
ESE ECs Sea s)) emcee eee ee (Gig: 
Avg mse (m). 


3.4 Evaluation of Bias of Synthetic Estimators 


Since the composition of the LFS frame and the Federal Electoral Districts 
were known for all of Canada in terms of both 1971 and 1976 census units, it 
was possible to compute exact biases of the synthetic estimators based on 
census data. The following cases were considered: (i) design and auxiliary 
information up-to-date (in which case the design, adjustment factors and 
x-variables were all based on the 1971 census); and (ii) design and auxiliary 
information out-of-date (in which case the design and adjustment factors were 


based on the 1971 census, but the x-variables were based on the 1976 census). 


Let gBsg and gB., denote the biases of the separate and combined 


synthetic estimates for unplanned domain 'a', then we have 


ve (325) 
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where aYhg 8nd Yhg are defined as in section 2, and _ where Xhg and 


aXhg are similarly defined for the x-variable (based on the census. 
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Relative absolute biases at the province level were obtained by summing the 
absolute biases over individual FED's and dividing by the provincial total for 


the x-variable. 
4. ANALYSIS OF RESULTS 
4.1 Efficiency considerations: Auxiliary Information up-to-date 
———eee SE a ey WO RMeaetOnvupeto=date: 


In this part of the empirical (Monte Carlo) study, data used for simulation of 
the design and the auxiliary variables used in estimation refer to the same 
period as those of the study variable; i.e., to the 1971 census. Efficiencies 
of the four small area estimators are presented relative to the direct 
estimator in Table 1, for separate and combined levels of construction, and 
for each of the following auxiliary variables - dwellings, total population 
(15+), and population by age/sex groups. Census Divisions in the province of 
Nova Scotia whose populations range from 3,885 to 39,260 were used as the 
unplanned domains (type c) for the purpose of the study. The following 


observations can be made: 


a) separate vs Combined Estimator: The level of construction of estimator 
does not have much impact on the efficiencies of synthetic estimators 
for both the characteristics employed and unemployed. For the 
post-stratified domain estimator for employed, however, the combined 
form is approximately twice as efficient as the separate. This is 
likely due to the effect of the clustering in the sample design being 


more accentuated with the separate estimator. 


Since the post-stratified domain estimator was less efficient in its 
Separate form, a similar result was anticipated for the composite 
estimator and hence, only the combined composite estimator was 
considered. On the other hand, the separate form of the sample 
dependent estimator was found to rely slighlty more on the synthetic 
component, leaving the efficiencies unaffected by the level of 


construction. 
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(ii) Effect of Auxiliary Information: The performance of population by 
age/sex as an auxiliary variable is uniformly superior, although only 
marginally so, to the total (15+) population for all four estimators 
using auxiliary information. Further, both these variables out-perform 


the dwelling count as an auxiliary variable. 


In actual survey situations, the choice of population by age/sex as the 
auxiliary variable may be desirable also from the point of viex of 
correcting estimates for biases due to non-response and undercoverage 


as both factors may be dependent on age and sex. 


(iii) Compararison among the estimators: For unemployed, performance 
of composite estimator with optimum a+ chosen for the characteristic 
unemployed is marginally superior to the other estimators irrespective 
of the level of construction, and the choice of auxiliary variable does 
not seem to have appreciable impact on any of the estimators. For 
employed, the situation is not that clear, however the sample dependent 
estimator shows an edge over other estimators and particularly so with 


population by age/sex as the auxiliary variable. 
4.2 Efficiency Considerations: Auxiliary information out-of-date 


In this part of the study whereas the design and auxiliary information were 
based on 1971 census results, the study variable was based on the 1976 
census. As can be seen from table 2, although for unemployed the use of small 
area estimation techniques showed larger .gains relative to the direct 
estimator (than in the up-to-date case), considerably smaller gains were 
observed for employed, which would likely be due to the reduced correlation 
between the study variable and the auxiliary information as both design and 
auxiliary information become out-of-date. Also in this case, the efficiency 


of the synthetic estimator is higher for both of the characteristics measured. 
4.3 Consideration of Bias 


Given that the post-stratified domain estimator will generally have negligible 


bias, the bias of both the composite and sample dependent estimators would 


5 


generally be smaller than that of the synthetic estimator, i.e. stemming only 
from the degree of reliance on the synthetic component. Hence the bias of 
synthetic estimator was investigated in detail. Using the total population 
(15+) as the auxiliary variable, the relative bias for the characteristics 
employed and unemployed were computed and are given in Table 3 for the ten 
provinces. Theses biases refer to the case where the unplanned domains are 
Federal Electoral Districts and the study variables are based on 1976 census 
data, while the survey design and adjustment factors (synthetic weights) are 
based on the 1971 census. Biases were also computed using age/sex sub-groups 
as the auxiliary variable and were found to follow similar trends while being 
marginally smaller. It is observed from this table, with the exception of the 
two smaller provinces, namely P.E.I. (for unemployed) and N.B. (for employed), 
that the relative bias of separate synthetic estimator is smaller than that of 
the combined synthetic estimator for both the characteristics under study. 
This confirms the intuitive feeling that the higher the level at which 
synthetic estimator is constructed, the higher would be the resultant bias in 


general, due to weakening of the assumption of homogeneity. 


Biases were also computed for the case when both the study variable and the 
auxiliary information referred to the 1971 census. Biases for this case while 


Slightly lower, followed similar trends to those in Table 3. 


While the bias of the synthetic estimator was fairly small on average, it can 
be observed from Table 4 that it exceeded 10% in 13 and 19 (out of 279) FED's 
when the auxiliary information was up-to-date and out-of-date respectively. 
Further, in about half the instances for which the bias exceeded 10% for the 
up-to-date case the bias also exceeded 10% for the later time period when the 
auxiliary information was out-of-date. This suggests that for domains with a 
known high bias at the time to which the auxiliary information refers, less 
use should be made of the synthetic estimator. For instance, with the sample 
dependent estimator the value of Ky could be set lower in such cases, 
However there is still the danger of bias in the synthetic estimator from 
category (ii) type cases in Table 4 which cannot be identified when deriving 


current estimates during the intercensal period. 


ee 


4.4 Efficiency vs Bias in Overall Choice of Estimator 


The synthetic estimator is generally highly biased and at the same time highly 
efficient. Therefore, in the search for a reasonable estimator for small 
areas, the question is to what extent one can reduce the effect of the 
synthetic estimator's bias, without sacrificing too much on its efficiency, in 
order to obtain a 'reasonable level of confidence’ in the final estimate. At 
the same time it is also important to determine the reliance on the synthetic 
estimator without introducing too many computational complexities. Looking 
from this perspective in the context of the Labour Force Survey, one should 
strive for small area estimators whose performance for unplanned domains is 
comparable to that of simple survey estimates for planned domains, and amongst 
estimators meeting this criterion, more emphasis should be on reducing bias 
than on improving efficiency, especially if the differences in efficiencies 


are minor. 


Average variances of the unbiased design estimator for the planned domains 
(say X), comparable in size to the unplanned domains were obtained analogously 
to the average mse defined in (3.1). The efficiencies of the synthetic, 
composite and sample dependent estimators relative to the usual _ survey 
estimate for the planned domain i.e. X were also obtained. These efficiencies 
ranged from 1.08 to 1.17 for unemployed, and 1.22 to 1.47 for employed, hence 
all three estimators meet the above mentioned criterion. Since the sample 
dependent estimator makes use of the synthetic estimators whenever there is 
not 'sufficient' sample in the domain, its bias would depend upon the weight 
attached to the synthetic estimator component and this can be controlled by a 
PLOper .enoLce. of Ko Table 5 presents the (1-6) values, averaged over 100 
replicates with Ky = 0.5 and Ko zs 1.0 for the separate sample 
dependent estimator using total population 15+ as the auxiliary variable for 
each of the Census Divisions (unplanned domains) in this study. These average 
(1-6) values indicate the degree of reliance of the sample dependent estimator 
on the synthetic component. As expected, domains consisting primarily of 
partial strata tend to place increased reliance on the synthetic component. 

Nevertheless, that reliance remains quite small. For example, with Ky =1% 


the highest value" 1t*assumes” is”.28" for™€ensus Division” 218% - 


ae) 


Also as expected, the average (1-68) values for Ke = 0.5 are lower. than 
those for KS = 1.0, implying the lower the value of so chosen, the 
lower would be the value of (1-6) and consequently less reliance (weight) on 
the synthetic component of the sample dependent estimator. However as 
illustrated in Table 1, a trade-off between bias and efficiency is involved 
since lower choices of Ko also result in reduced efficiency. The above 
values of KS provide a reasonable degree of confidence for the type of 
domains discussed here. In general, however, other values of Ko may be 
chosen depending upon e.g. the size of the domain, sample size, strata sizes 


and their geographical configurations with respect to the domain. 


(Daisy Concluding Remarks 


1. The use of population by age/sex fares uniformly better than the 
other auxiliary variables, although gains over total population (15+) 


are mariginal. 


2. The post-stratified domain estimator although more efficient as com- 
pared to the simple domain estimator, performs poorly as compared to 


the other three small area estimators investigated. 


3. From the point of view of bias, the separate estimator has smaller 
relative bias as compared to the combined synthetic estimator. Fur- 
ther while average biases tend to be fairly small and tend to in- 
crease only slightly when the auxiliary information became out-of- 
date, biases for individual domains can be very high and change 
dramatically, frustrating efforts to identify ‘outliers' where 


reduced reliance on syntheic estimators should be made. 


4. The combined composite estimator constructed as a linear combination 
of post-stratified and synthetic estimators is more efficient than 
either of its component estimators although only marginally so, as 
compared to the synthetic component, for optimum value of «% Its 
bias would depend upon the weight attached to the synthetic component 


Since the bias of the post-stratified estimator would generally be 
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negligible. Further, as the computation of the optimum qa is quite 
involved, in practice only an estimated value of a may be used, 


resulting in a decrease in efficiency of this estimator. 


5. The synthetic, composite and sample dependent estimator with Ky =| 
are all more or less equally efficient, and out-perform the un biased 


design based estimator for planned domains. 


6. Since the bias of the separate synthetic estimator is smaller than 
that of the combined synthetic estimator, the separate sample 
dependent estimator would result in smaller relative bias as compared 
to the combined sample dependent estimator. Theg*biasa) joferthe 
separate to the combined sample dependent estimator. The bias of the 
separate post-stratified domain component can be controlled by col- 
lapsing those strata for which the intersection with the domain is 
very small. Thus considering all the three aspects, bias, mean 
Square error and the computational complexities, the sample dependent 
estimator constructed at the stratum level using population by age 


and sex would seem to be a better choice. 
5. FUTURE DIRECTION OF INVESTIGATION: 


The study reported in this paper has focussed on evaluation of certain small 
area estimation methods using only census and survey data, in the context of 
the LFS, primarily for unplanned domains (type c). The estimators examined 
made use of synthetic and post-stratified domain estimators in different ways 
in an attempt to strike a balance between bias and mean square error. Below 
we point to directions which future investigations might. take sins efforts te 
develop statistically sound techniques for small area data in the Canadian 


context. 


In the context of the Labour Force Survey, since the small area estimation 
methods for the unplanned domains have out-performed the unbiased design based 
estimates for comparable planned domains, it would be desireaable to extend 


this investigation to certain small planned domains (type a) as well. In par- 
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ticular the sample dependent estimator considered here and other similar esti- 
mators discussed in the literature will be further investigated for the Labour 
Force characteristics. In addition these investigations should also be 
extended to other smaller Surveys conducted by Statistics Canada for which 
small area data are in demand. Further work on development of methods of 
variance estimation to be used in practice for these estimators is also 


needed. 


Other estimators which seem to be promising are the Structure Preserving 
Estimators (SPREE) suggested by Purcell and Kish (1980). In this approach the 
estimation process, specified by the association structure (i.e. the relation- 
ship between y and x variables at some previous time at domain level) and the 
allocation structure (i.e. the current relationship at the larger area level), 
preserves the earlier relationship present in the association structure 
without interfering with current information in the allocation structure. In 
the Canadian context, for characteristics for which large scale surveys (such 
as the Labour Force Survey) are undertaken regularly, it would seem the short 
term demand for data for domains of the size of FED's or Census Divisions may 
be met through the use of refined estimation techniques (and pooling of 
estimates over a period of time) utilizing census and servey data alone. 
However, for meeting such demands in the longer term and for other types of 
data based on smaller surveys and other types and sizes of domains, all three 
sources of data namely census, surveys and administrative files would have to 
be fully explored. Multi-variate linear regression estimators of the type 
considered by Ericksen (1974) and Gonzalez and Hoza (1978) using data from all 
three sources should be studied in detail for their bias, mean square error 
and the computational complexities. Each of the three sources, with 
limitations of their own, when put together offer considerable potential for 
improvements in the sense that the weaknesses of one source can be the 
strengths of another. Hence there is reason for optimism that statistically 
sound techniques exploiting the strengths of data from different sources in an 
integrated fashion hold the future key to good quality small area data for a 


large variety of subject matters. 
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CHARACTERISTICS OF RESPONDENT AND NON-RESPONDENT 
HOUSEHOLDS IN THE CANADIAN LABOUR FORCE SURVEY 


Elizabeth Clayton Paul and Murray Lawes ! 


This article presents findings from a study to characterize 
responding and non-responding households in the LFS. This study 
was motivated by two projects associated with the LFS Redesign, 
namely, the family estimation project and evaluation of non- 
response compensation procedures. However, the results of the 
study are of general interest in the assessment of the quality of 
data emanating from the LFS. 


1. INTRODUCTION 


Non-response is the lack of complete information for all selected units in a 
sample or census. The occurrence of non-response poses special problems for 
the producers and users of survey data. Non-response affects the quality of 
survey data in two basic ways. First, it reduces the effective sample size, 
resulting in loss of precision of the survey estimates. Second, to the extent 
that differences in the characteristics of respondent and non-respondent units 
are not properly accounted for in the estimation strategies, it may introduce 
a bias into the survey estimates. This paper focuses on the latter aspect of 
quality, specifically the characterization of respondent and non-respondent 
units in the Canadian Labour Force Survey (LFS). This information will pro- 
vide some insight into the potential effect of non-response on the survey 
estimates and will suggest some variables which should be considered when 
compensating for non-response. Units were characterized by the variables size 
of household, economic family type, length of time in the survey, location, 
age of household members and labour force status of household members. This 
study is based on data derived from the LFS longitudinal data files. A 


statement of 
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major findings from this analysis is found in Section 2 followed by a brief 
description of the LFS, of the longitudinal files and the methodology used to 
characterize non-respondent households in Section 3. Section 4 then presents 
the derived data and resulting analysis. The final section briefly discusses 
the impact of the findings of this study on the quality of ‘LFS: dataitat , the 
individual, family and household levels and suggests potential methods of 
dealing with non-response to alleviate or minimize deficiencies in the survey 


data arising due to non-response. 


2. STATEMENT OF MAJOR FINDINGS 


Within the LFS, non-response compensation procedures are based on the assump- 
tion that the characteristics of non-respondent households are similar to the 
characteristics of respondent households. Should this assumption prove incor- 
rect, the non-response adjustment procedure will contribute to a bias in the 
survey estimates. It is impossible to determine the exact extent of this 
non-response bias. However, by examining longitudinal data on the survey life 
of a household, a profile of respondent and non-respondent households may be 


determined and the extent of differences evaluated. 


Of the many variables examined in the characterization of respondent and 
non-respondent households, the variables month in sample, household size and 
labour force status of household members exhibited a definite trend in rela- 
tion to response status. With respect to month in sample, the levels of non- 
response decreased as month in sample increased. Between months one and two 
the percentage of non-respondent households decreased sharply, and then grad- 
ually continued to decrease until month six, implying survey tenure is a crit- 
ical factor in the determination of survey response. Thus any estimates by 
rotation number based on a non-response adjustment across all rotation groups 


may impart a slight bias to estimates on a rotation number basis. 


Regarding household size, non-response decreased as household size increased. 
On a distributional basis there were almost twice as many households of size 
One for non-respondent households as for respondent households; and con- 


versely, for households of size 5 and over, there were over twice as many 


0 


households for respondent households. The implication is that a non-response 
adjustment which does not take household size into consideration will, on 
average, represent non-respondent households by households which contain more 


household members than the non-respondent households. 


The response patterns exhibited by household size and month in sample remained 
unchanged when the two characteristics were jointly examined. Since the anal- 
ysis of these two variables, household size and month in sample, has shown a 
strong functional relationship with non-response, a non-response adjustment 
incorporating household size and month in sample should do much to alleviate 
discrepancies by rotation number in sample survey estimates of household and 


economic family units, and of characteristics dependent on these variables. 


In addition to household size and month in sample, a relationship between 
non-response and labour force status was also exhibited, with particular ref- 
erence to unemployment. For non-respondent households, the percentage of in- 
dividuals classified as unemployed increased as month in sample increased, 
while the percentage for respondent households remained relatively stable. 
When the added dimension of household size was examined, a definite 
relationship was exhibited for households of size one with a slightly more 
variable pattern being exhibited for households of size two or more. For 
households of size one, the percentages of individuals classified as employed 
and unemployed were substantially greater for non-respondent households than 
for respondent. Also, the percentage of employed individuals decreased as 
month in sample increased; however, the percentage of unemployed increased. 
For households of size two or greater, the differences in the labour force 
distributions for respondent and non-respondent households were less 
pronounced than those for size one households, but the percentage of 
unemployed individuals in non-respondent households of size two or more did 


generally increase as month in sample increased. 


Although there may be advantages in utilizing some variables relating to 
labour force activities in addition to household size and month in sample in 
the non-response adjustment process, and thus improving the labour force 


estimates; the desire for a general weight adjustment, the small sample size 
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at this level of aggregation, and the relatively low level of non-response 
currently experienced in the LFS may preclude the implementation of a non- 
response adjustment based on labour force status related variables. However, 
a non-response adjustment on the basis of household size and month in survey 
should have some benefits for the labour force estimates. Consequently, it 
may be feasible to consider adjustments for two groups of households, namely 
size one and size two or more, and for two survey tenures, namely one month 
and two months or more, in evaluating any improvements to the current LFS 


non-response adjustment process. 


3 DATA SOURCE 
3.1 The Labour Force Survey 


The LFS is a multi-stage stratified random sample with stratification occur- 
Ting within the economic region level for each province. The final unit of 
sample selection is the dwelling. Each selected dwelling remains in the sur- 
vey sample for six months. At the end of that time, these dwellings are 
replaced by another group of dwellings in such a manner that every month 
one-sixth of the sample is replaced or rotated. This implies that in any 
given month, there are six panels of dwellings in the LFS with each panel at 
various stages of aging. That is, one panel is in the survey for the, first 
occasion (i.e., the birth rotation group), one panel for the second 


occasion,..., and one panel for the sixth occasion. 


During one week each month, Survey Week |, LFS interviewers contact selected 
dwellings to obtain information on the composition, demographic variables and 
labour market activities of household members who are part of the survey 
universe *. For various reasons, interviewers are unable to obtain information 
from all selected dwellings. These dwellings where no interview is conducted 
are classified as vacant dwellings or non-respondent households 3, depending on 
their occupancy status. For vacant dwellings, no response is obtainable or 
expected; whereas, for non-respondent households, survey information is 
missing. An adjustment * for non-response to compensate for this missing 


information is made at the data processing stage based on the assumption that 
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households which have been interviewed, i.e., respondent households, typify 
households which should have interviewed, i.e., mon-respondent households. 
Should this assumption be false, then a bias is introduced into the survey 
estimates by this adjustment for non-response. This bias will increase as the 
rate of non-response increases. For this reason, it is important that the 
characteristics of non-respondent and respondent households be similar, and 
for this reason much effort is expended (successfully) in minimizing non- 


response. 


3.2 Longitudinal Data File 


Estimates based on monthly cross-sectional LFS data provide a static snapshot 
of the population and labour market for each month; however, by linking resp- 
ondent information over the survey lifetime, a dynamic view of labour market 
activities is observed. In any given month, dwellings in one of the six rota- 
tion panels complete their six-month tenure in the survey. For dwellings in 
this panel, it is possible to trace the household composition and response 
pattern over the previous five months. This tracing is done by means of the 
Longitudinal Data File. The Longitudinal Data File is formed by concatenating 


the information on a given household over its six months of survey life. 


In the LFS, dwellings and individuals are assigned unique identification 
codes. This affords a method of linking individual, household, and dwelling 
information over the six months a dwelling is in the survey, thus creating the 


Longitudinal Data File. 


Initially, longitudinal records containing the six monthly response status 
codes are created for each dwelling. If a dwelling is respondent for one or 
more months, then individual records containing information on the household 
members who were living in the household at the time it was respondent are 
also included on the longitudinal file. However, if no response is indicated 
over the six months, only basic dwelling information is available for the 
dwelling. Thus, every individual who was a household member at some time over 


the six-month survey period is associated with a Longitudinal Data File 
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record. from this record, labour market activity and demographic information 
can be obtained for the months the individual was a responding household 
member. Based on this formulation of longitudinal data, examination of resp- 
onding and non-responding households can occur and the characteristics of each 


response type evaluated. 
3.3 Methodology for Deriving Estimates 
ING, EStImates: 


In examining the characteristics of responding and non-responding households, 
the type of household response for each month was required. On a monthly 
basis, there are three types of dwelling responses: respondent, non- 
respondent, and vacant. Responding households are those where the LFS ques- 
tionnaire is completed for all or some eligible household members. Non- 
respondent households are occupied by individuals who should be included in 
the survey but, for some reason, choose not to participate or are unable to 
participate due to existing circumstances. Vacant dwellings, on the other 
hand, are not occupied, or are occupied by individuals not included in the 


survey universe. 


Thus, in determining the characteristics of responding and non-responding 


households, dwellings labelled as vacant were ignored. 


To obtain the characteristics of responding households, the characteristics of 
individual household members who responded in the survey were examined; 
however, to obtain the characteristics of non-responding households, an 
imputation strategy was implemented. The characteristics of a non-responding 
household should be identical to or closely approximated by characteristics of 


individuals in that household in a month of response. 


For those households who did respond at least once during the six months the 
household was in the survey, the months of response were the information 
donors for any months of non-response during the six months. In this manner, 
the characteristics of non-responding households were estimated. To impute 
for non-response by this method, it was imperative that a given household be 


Tespondent for at least one month; however, the household could have been 
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respondent for more than one month. If this latter situation occurred, the 
month of response closest to the month of non-response provided the donor 
information. If two months of response were equally close to a month of 
non-response, the month prior to the month of non-response was chosen as the 


donor month. The following algorithm summarizes this technique. 


Month of Ordering of months to 
Non-Response check for donor information 
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If there was no month of response available, then no imputation wasperformed 


and this household was excluded from this study. 


3.4 Cautionary Note 


If non-response rates based on this study are compared to non-response rates 
by rotation groups from the monthly LFS, they will differ in magnitude. The 
main source of difference is the exclusion of certain non-respondent house- 
holds from this study of longitudinal data. As previously indicated, the 
ability to characterize a household in a month of non-response depended on the 
availability of respondent data in an alternative month for that household. 
That is, there had to be at least one month of response for a non-respondent 
household to be characterized. This implies that a household which was non- 
respondent, or a combination of non-respondent and vacant, for each of the six 
months it was part of the survey sample was excluded from this study. Thus, 
some non-respondent households which contributed to the monthly LFS measure- 


ment of non-response did not contribute to this longitudinal study of non- 
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response. Approximately 1.4% of the total sampled households were excluded on 


this basis, 


Exclusion of some non-respondent households is the main reason for differences 
in data from this study and any other study on non-response which is based on 
the monthly LFS data. In addition to this source of discrepancy, the weighting 
technique applied may cause estimates to vary. For this report records were 
weighted by a product of the inverse sampling ratio, the sub-sampled cluster 
weight, and the stabilization weight >. In examining and interpreting the 
results in Section 4, or comparing these results to any other study on non- 
response, it is necessary to remember that the data source was. the 
Longitudinal Data File, only records with at least one month of response 
contribute to the estimates, and the weighting structure was based on sample 


design weights only. 
4. ANALYSIS 


The methodology in the previous section documented the procedures used to 
derive estimates of characteristic totals from the longitudinal file. In this 
section a number of variables (separately and jointly) are examined with 
respect to their characterizations between respondents and non-respondents. A 
particular variable or cross-classification of variables is dealt with in each 
of the following subsections. The motivation for examining the variables, 
tables containing relevant tabulations and a summary of the essential results 


are presented for the various subsections. 
4.1 Month in Sample 


As noted in the introduction the LFS is based on a rotating panel design with 
€ach panel of dwellings remaining in the sample for a period of six months. 
At the sample design stage, considerable effort is taken to ensure that the 
Sample associated with each rotation number (i.e. dwellings by panel) is a 
fepresentative one-sixth subsample of the full LFS sample. In the past a 
number of references have been made to the phenomenon of rotation group bias, 


i.e. that the expected value of estimates based on a Single panel differs 
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depending on number of months in the sample. For this reason the composition 
of the sample by month in sample and by response status were examined. 
Weighted estimates of the number of households at the Canada level by month in 
sample and by response status were obtained based on averages over 1980 and 
1981 and are presented in Table 1. Due to design efforts to ensure represen- 
tativeness of the sample by rotation number, it was expected that the total 
weighted counts would be equally distributed by month in sample. Examination 
of the data revealed that very close to one-sixth (or 16.67%) of the total 
households fall into each month in sample class. In all cases the differences 


in percentage distribution for a cell were within one-half of 1%. 


When distributions of households by month in sample were examined by response 
status, deviations from a uniform distribution were observed, particularly for 
non-respondent units. The non-response rates by month in sample exemplified 
this fact. As illustrated in Table 1, the rate of non-response decreased as 
the number of months in the survey increased. The largest decrease occurred 
between the first and second months in the sample when the rate in the second 
month was approximately one-half of the rate in the first month. Further 
reductions in the non-response rates were observed as the number of months in 
the sample increased. Decreases in the rates between the second and sixth 


months were 21.1% and 34.2% for 1980 and 1981 respectively. 


The percentage distribution of non-respondent households exhibited a similar 
decreasing trend as number of months in the sample increased. On a distribu- 
tional basis, there are substantially more non-respondent households in the 
first month in sample than there were respondent households; however, this 
number decreased with increasing tenure in the survey. Thus any estimates by 
rotation number based on a non-response adjustment across all rotation numbers 


may impart a slight bias to estimates on a rotation number basis. 
4.2 Household Size 
In the LFS, non-response generally occurs at the household level, i.e. the 


rate of partial non-response within households is very low. The household is 


the unit at which non-response occurs. Thus the characterization of house- 
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holds is necessary for the determination of the effects of non-response on 
estimates from the survey - be they at the level of household, family, or 
individual units. Perhaps the most basic household attribute, in relation to 
deriving demographic/socio-economic estimates from the survey, is household 
size. From a data collection point of view it is reasonable to assume that 


difficulties of contacting households decrease with increasing household size. 


To evaluate the potential effect of household size on the non-response rate, 
Table 2 presents the percentage distribution of households by size and resp- 
onse status based on averages over the calendar years 1980 and 1981. For both 
years the non-response rate decreased dramatically as household’ size 
increased. Non-response rates by household size ranged from a high of 7.48% 
for households of size 1 to a low of 1.89% for households of size 5 or more in 
1980 and correspondingly from 6.58% to 1.69% in 1981 for households of sizes 
land 5 or more, respectively. An examination of the distribution of respond- 
ing and non-responding households by size of household revealed a substantial 
difference in the distribution of households by size depending on the response 
status. On a distributional basis there were almost twice as many households 
of size one for non-respondent households as for respondent households. For 
respondent households there were slightly more than 50% which were of Sizems 
or more, whereas for non-respondent households only about 30% were of size 3 
or more. The distributional differences in household size between respondent 
and non-respondent households was also reflected in the average household size 
for each response type. For 1980 the average household size for respondent 
and non-respondent households was 2.93 and 2.26, respectively; while for 1981 
the corresponding sizes were 2.88 and 2.19. The implication is that with the 
adjustment for non-response at the LFS data processing stage, non-respondent 
households are represented by households which, on average, contain more 
household members than the non-respondent household. This leads one to ques- 
tion the assumption that respondent households typify non-respondent house- 


holds, at least with respect to household size. 
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4.3 Household Size by Month in Sample 


In the previous two subsections substantial variations in the response rates 
were noted depending on the number of months in sample and also depending on 
the size of household. The next table was obtained to determine whether the 
noted variations in non-response rates were also observed when either house- 
hold size or month in sample was held constant. Based on annual averages for 
1980 and 1981, Table 3 presents percentage distributions of respondent and 
non-respondent households by household size and month in sample as well as the 


corresponding non-response rates for 1980 and 1981, respectively. 


These tables show that the decreasing trends in non-response rates observed in 
Tables 1 and 2 for the full populations also hold true when the rates are 
examined holding one of the variables constant and letting the other vary. 
For example, in Table 1 non-response rates for all household sizes combined 
were shown to decrease as month in sample increased. Table 3 generally shows 
the same phenomenon when one examines the pattern of response rates by month 
in sample for each of the household size groupings separately. As when months 
in the survey alone were examined, the non-response rate decreased sharply 
from month one to month two. Similarly, the non-response rate decreased from 
month one to month two by approximately one half for each given household 
size. For households of size one and two the non-response rate continued to 
decrease in subsequent months in the survey; however, for households of size 
and greater the non-response rate tended to stabilize during the second month 


in the survey. 


Holding the number of months in the survey constant and examining the non- 
response rate as the household size varied, revealed a pattern Similar to that 
exhibited in Table 2, where household size alone was considered. The non- 
response rate decreased with increasing household size. Table 3 likewise 
shows that for a given number of months in the survey (from one to six), there 


is a decreasing trend in the non-response rate as household size increases. 


Combining these two trends, there was an expectation that the highest non- 


response rate would be observed in households of size one during the first 
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month in the survey. Similarly, there was an expectation that the lowest non- 
response rate would be observed in households with five or more members during 
the final month in the survey (i.e., in month six). Based on annual averages 
for 1980 and 1981, this expectation was verified. In 1980 and 1981 the non- 
response rates of highest magnitude were 13.39% and 12.81% respectively. Each 
of these rates applied to households of size one during the initial survey 
month. The non-response rate of least magnitude in 1980 was 1.54%. This 
applied to households containing five or more members during the third month 
in the survey; however, a non-response rate of 1.59% also applied to house- 
holds containing five or more members for month 6. In 1981, the non-response 
rate of least magnitude was 1.37%. This occurred in households having five or 
more members during month 3, while the non-response rate for month 6 was 
1.39%. Thus, although the lowest non-response rate did not uniquely occur in 
households containing five or more members during the final survey month, the 


non-response rate for households in this cell was not significantly different. 


The distributions of household size by survey duration by response status 
indicated the potential for non-response bias in survey estimates. A non-— 
response adjustment which does not take into account household size, will im- 
plicitly compensate for non-respondent households on the basis of the distri_ 
bution of respondents, i.e., underestimating households of size 1 and 2 and 
over-estimating households of size 3 or more. It can be seen on a distribu- 
tional basis that there were substantially more households of sizes 1 and 2 
among non-responding households than there were among responding households 
and, of course, conversely fewer households of larger s'sizesi-Gyiy4 andw5}) 
among the non-responding households. than among the responding households. 
This discrepancy in distributions became more exaggerated when months’ in 
Sample, or rotation groups, were considered, particularly for months one and 
two. After month two, the non-response rate tended to stabilize for house- 
holds of size greater than two, whereas for households of size 1 or 2, the 
non-response rate continued to vary over the survey lifetime. This suggested 
that household size and rotation number are important characteristics to 


consider when methods for non-response adjustment are being evaluated. 
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4.4 Family Composition of Household 


In Section 4.2 there were substantial differences in the distribution of 
households by size between respondent and non-respondent households. To fur- 
ther evaluate household size discrepancies between respondent and non- 
respondent households, tabulations of households in terms of their composition 
of family types were obtained. The family type compositions were based on the 
number of economic families in the household, the size of the family units, 
the presence of children, and the marital status and age of the head of the 
family unit. The specific variables are indicated in Table 4a with corre- 
sponding percentage distributions and non-response rates by type by response 


status in Table 4b. 


The higher non-response rates for households of size one were again evident 
from these tabulations. The rates were particularly high for households con- 
taining only an unattached individual aged less than 65 years of age. House- 
holds containing a married couple with other members present in the household 
(children or non-children) i.e., codes 6, 7 and 8 had low non-response rates 
relative to other types of households. In other words, there were proportion- 
ately more of these types of households among the responding than among the 
non-responding households. Households containing only unattached individuals 
(either one or more) and households containing a married couple only formed a 
higher percentage of non-responding households than of responding households. 
Thus in addition to household size, the composition of the household in terms 
of family types appeared to have some influence on the rate of non-response. 
Thus certain types of family units may not be properly compensated for in 
various weight adjustment strategies for non-response. This is particularly a 


crucial issue in the production of family estimates. 


4.5 Age of Individuals 


Although the unit of potential response is generally the household, Table 5 
presents percentage distributions by age group and response status at the in- 


dividual level. Also presented are the distributions of the non-respondents 
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as percentages of the total population, or these could be réferred —to,. as 


individual level non-response rates. 


The rate of non-response for all individuals combined were 3.13% and 2.63% for 
1980 and 1981 respectively. These rates corresponded to household level non- 
response rates of 4.02% and 3.43% respectively for 1980 and 1981. The lower 
rates at the individual level were indicative of the inverse relationship 
between the size of household and the level of non-response as pointed out in 
Section 4.2. Since larger households had lower non-response rates, a greater 
proportion of individuals fell into the responding category. The relation- 
ships on a distributional basis between individual respondents and non-resp- 
ondents bore out the results of the previous section with FeSspecy- cos whe 
generally lower household non-response rates in households which contained 
children. For the age groups 0-14 and 15-19, the non-response rates in 1980 
were 2.50% and 2.42% respectively, while in 1981 they” were 2.172%.-and 1.92%. 
The highest non-response rates were observed in the age groups 65+ and 20-24. 
This again reflected the inverse relationship between household size and the 
non-response rate. Households of size 1 and 2 had the highest non-response 
rates. Individuals within the age groups 65+ and 20-24 were more lvkely -to 
live alone or as a couple; hence, the non-response rates for these individuals 
were expected to be high. The variation in non-response rates by individual 
age groups indicates a potential effect on the quality of survey based esti- 
mates. In particular, age groups with a lower non-response rate than the 
over-all individual non-response rate will be over-estimated by a weight 
adjustment factor which does not take into account age variables. The oppo- 
Site occurs when the non-response rate for the age group is greater than the 
overall individual non-response rate. To some extent any distortions 
introduced at the provincial level are corrected by the application of the 


ratio adjustment procedure. 
4.6 Age of Individuals by Size of Households 
Continuing from the previous section the distributions of individuals by age 


groupings and response status were obtained within various household size 


breakdowns. These distributions as well as non-response rates, are presented 


ea 


in Table 6a based on 1980 annual averages and Table 6b based on 1981 annual 


averages. 


The distributions of individuals by age group were relatively similar by 
household size between respondents and non-respondents in households of sizes 
2, 3, 4 and 5+; however, for households of size 1 there were substantial dif- 
ferences in the distributions. Within size 1 households the primary differ- 
ences were for age group 25-44 in which there were substantially more indiv- 
iduals (on a distributional basis) in non-responding than responding house- 
holds (39.6% compared with 28.8% for 1981 and 35.5% compared with 27.9% for 
1980) and for age group 65+ in which there were substantially fewer individ- 
uals in non-responding households than in responding households (22.3% com- 
pared with 34.3% for 1981 and 22.4% compared with 34.3% for 1980). This 
latter observation is particularly important as about 28% of the population 
65+ reside in households of size 1 whereas less than 5% of individuals in the 
age group 25-44 reside in households of size 1. Thus, it is differences in 
the distributions by age groups between respondents and non-respondents which 
merit special attention in any procedures to compensate for non-response in 


households of size 1. 


The non-response rates in Tables 6a and 6b show that individual non-response 
rates within age groups exhibit the same pattern across household size 
measures as was observed in Section 4.2, namely that non-response rates 
decrease as household size increases. Within a particular size of household 
the relationships of non-response rates by age group were very different than 
non-response rates by age groups for all household sizes combined. Perhaps 
most notable was the fact that for each household size group separately 
(except size 4 in 1980), individuals 65+ exhibited the lowest level of non- 
response whereas the non-response rate for individuals 65+ in all households 
combined was the largest of any age group. This phenomenon resulted from the 
fact (mentioned earlier in this section) that the majority of individuals of 
age 65+ live in households of size 1 or 2, where the non-response rate was the 


greatest. 
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These tables indicate that non-response is very much dependent on household 
size and that age is not an important factor apart from the fact that there is 
a relationship between household size and the age of individuals residing in 
the household. 


4.7 Age of Individuals by Month in Survey 
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The distribution of individuals by age group for varying numbers of months in 
the survey, separately for respondents and non-respondents, are presented in 
Tables 7a and 7b for 1980 and 1981 respectively, as well as the corresponding 


non-response rates. 


From Tables 7a and 7b it can be noted that distributions by age group for 
respondents were virtually identical regardless of the number of months in 
sample. Although the distributions for non-respondents showed a higher degree 
of variability for differing months in Sample, there remained a degree of 
Stability in the distributions. The pattern between distributions for resp- 
ondents and for non-respondents was similar for each month in sample breakdown 


as it was for totals across months in sample. 


A study of individual non-response rates again indicated in general a 
decreasing trend as number of months in sample increased. This occurred for 
individual age groups as well as for the total population. As expected the 
pattern over time was not as pronounced for individuals as it was on a house- 
hold basis. This can be attributed to changes in the response pattern for 
various sized households; that is, there is a tendency for larger sized house- 
holds to become non-respondents in the later survey months while smaller sized 


households tend to become respondent (refer to Table 3). 
4.8 Labour Force Status 


In this subsection attention is turned from the basic demographic charac- 
teristics of households by response status to the characteristics of labour 
force activity. This evaluation was motivated by the desire to assess poten- 


tial non-response bias in the survey estimates of these characteristics. 
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Section 4.2 presented substantial differences in the distributions of respond- 
ent and non-respondent households by household size, while Section 4.1 
presented similar findings for month in sample. For this reason, the distri- 
butions of individuals by labour force status within each category defined by 
household size, month in sample, and response status were examined. They are 


presented in Table 8a. 


Examination of these distributions by labour force status for all individuals 
regardless of size of household, showed that the distributions for respondent 
households differ in some important ways from the distributions of non-respon- 
dents and the pattern of differences was not consistent over time. The per- 
centages of individuals unemployed showed perhaps the most interesting 
changes. For respondents, this percentage was relatively constant for each 
number of months in the sample; whereas, for non-respondent households, there 
was an increase in the percentage of individuals unemployed as the number of 
months in sample increased. The percentage of the population (aged 15 and 
over) unemployed for respondent households ranged from a low of 4.7% in months 
3 to 6 to a high of 5.0% in month 1 for 1980, and a low of 4.6% in months 4 
and 5 to a high of 4.9% in month 1 for 1981. For non-respondent households, 
the corresponding range of percentages was 4.5% in month 1 to 6.4% in month 6 
for 1980, and a low of 4.0% in month 1 to a high of 6.2% in month 5 for 1981. 
A comparison of the percentage unemployed for each response status over time 
shows that there were fewer unemployed persons among non-respondent than res- 
pondent households for households in the sample for the first occasion and 
more unemployed persons among non-respondent than respondent households for 
households in the sample for four to six months. The relationship was 
variable for months two and three. A comparison of the percentage distribu- 
tion patterns of labour force activities for respondent households over time 
indicated a relatively stationary distribution; however, the pattern for non- 
respondent households varied. For non-respondent households there were 
greater fluctuations in the percentage distributions for each labour force 
status across months. No distinct pattern of change was exhibited except with 
unemployment where representation increased with survey duration. This varia- 
tion among non-respondents was at least partly attributable to small sample 


sizes of non-respondents relative to sample sizes for respondents. 
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Since unemployment is more sensitive to sample fluctuations than the other 
labour force statuses and exhibits a definite trend over time, compensating 
for non-response over rotation groups would distort this characteristic. 
Adjusting over rotation groups would result in an overestimation of unemploy- 
ment in month 1, and an underestimation of unemployment in months 4 to 6. 
Since the divergence between responding and non-responding households in the 
percentage distribution of unemployment was more pronounced in the later 
survey months, the overall effect would be an underestimation of unemploy- 
ment. Since the non-response adjustment occurs at the household level, not at 
the individual level, and the size of the household has proven to be an 
important response determinant (see Section 4.2), it is essential to consider 
household size as an additional component for the evaluation of non-response 


with respect to the labour force status. 


When distributions by labour force status and month in sample were examined by 
household size breakdowns, the patterns or relationships noted above did not 
hold. For households of size 1, the proportions of individuals employed and 
unemployed were substantially higher for non-respondents than for respon- 
dents. For respondents the proportion of individuals employed and the propor- 
tion unemployed were relatively constant for varying number of months in the 
Sample. For non-responding households, there was a general decrease in the 
proportion of individuals employed as the duration in sample increased; 
whereas, there was a substantial increase in the proportion of unemployed as 


the number of months in sample increased. 


For households of other sizes (2,3,4 and 5+), the differences between labour 
force status distributions for respondent and non-respondent households were 
much smaller. Also patterns between distributions for respondents and non- 
fespondents were not nearly as strong or consistent as for the case of house- 
hold of size 1. On a distributional basis, there were generally fewer un- 
employed individuals in non-respondent households for the first survey 
occasion and more unemployed individuals in non-respondent households for the 
fourth and subsequent months in the sample, than for responding households. 


For households in the survey for two or three months the pattern was variable. 
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The percentage of individuals "not in the labour force" differed between 
responding and non-responding households by household size. In households of 
size 1 and 2 there were fewer individuals "not in the labour force" in non- 
responding households than in responding; whereas, no definite pattern existed 
for households of size 3 or more. As the employed constituted the majority of 
the group "in labour force", generally the relationship on a distributional 
basis between respondent and non-respondent households was the complement of 


that noted for the characteristic "not in the labour force". 


Table 8b presents unemployment rates by household size and month in sample by 
response status for 1980 and 1981 respectively. These results are related to 
those in the previous tables and observations may be similar in that the 
relationship between unemployment rates for respondents and non-respondents 
are the result of the relationships between proportions employed and unemploy- 


ed between respondent and non-respondent units. 


For all individuals (i.e., regardless of household size) the rate of unemploy- 
ment for non-respondents was less than the rate for respondents for the first 
month and greater than the rate for respondents in months 4 to 6. The rela- 
tionship between the rates for months 2 and 3 varied by year. For non- 
responding households, there was a substantial increase in the unemployment 
rate as the number of months in the survey increased. This phenomenon was not 
observed for respondents where the first month in sample had the highest rate 


but the pattern for subsequent months was somewhat variable. 


For households of sizes 2, 3, 4, and 5+ the same general relationship in un- 
employment rates between respondent and non-respondent households was observed 
as for the full set of individuals (i.e., regardless of household size). 
There was no definite pattern in unemployment rates over time for non- 
respondent households when various household sizes were considered. For 
households of size 1 the unemployment rate for non-respondents was generally 


higher than the rate for respondents. 


a 


= ep = 


4.9 Type of Area 


Results presented in Section 4.3 showed that there were substantial differ- 
ences in distributions of households by size and month in sample between res- 
ponding and non-responding households. This section further examines these 
results within broad types of area determined generally on the basis of popu- 
lation concentration and density; namely, self-representing areas (SRU) , 
non-self representing urban areas (NSRU urban), and non-self-represent ing 
rural areas (NSRU rural), Although a more precise definition of area types is 
available, for this study it is sufficient to note that SRU's consist of the 
larger cities in the country, NSRU urban areas consist of smaller cities and 
towns, and NSRU rural areas are composed of the more sparsely populated 
portions of the country, including small villages and farm land. Due to the 
very small sample sizes, special areas were not considered. In very general 
terms, the patterns observed in Section 4.3 for all area types combined, were 
Similar to those observed for the three broad area types; however, there were 
different distributions by household size for respondents depending on type of 
area. In SRU areas, on a distributional basis, most households were smaller 
sized whereas there were fewer smaller sized households in NSRU rural areas. 
The opposite was observed for larger sized households. The relationship 
between respondent and non-respondent households, however, was relatively the 
Same regardless of type of area. From Tables 9a and 9b it can be noticed that 
there were approximately twice as many households of size 1 in non-responding 
households as in responding households and approximately one-half as many 


larger sized households (5+) in non-responding as in responding households. 


Non-response rates, although levels differ by type of area, showed the same 
pattern of decreases by number of months in sample as was observed for all 
units combined (i.e., as compared with results presented in Section 4.3). 
Again there were substantial decreases in levels of non-response between the 
first and second months with decreases of lesser magnitude occurring in sub- 


Sequent months. 


The rates of non-response for all households (i.e., regardless of household 


Size) were the highest for SRU areas, followed by NSRU urban areas and were 
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the lowest for NSRU rural areas. These differences were a function of the 
distributions of households by size across area types. Within specific size 
of household groupings, the patterns between respondent and non-respondent 
households are generally the same as when examined for comparable size group- 
ings for all area types combined. The type of area variable is an important 
factor in compensation procedures as it differentiates between areas with 
different levels of non-response. However, in addition to size of households 
and month in sample variables the type of area variable does not provide much 
additional information in the characterization of survey units by response 


status. 


5. SUMMARY 


The previous section presented characterizations of responding and non-res- 
ponding households with respect to a wide range of variables. The households 
and/or individuals displayed somewhat different characterizations depending on 
their response status. On the assumption that responding and non-responding 
households exhibit similar characteristics, it would seem to be important to 
incorporate some of the variables examined in Section 4 into non-response com- 


pensation procedures for the survey. 


The method of compensating for non-respondent households in the LFS is carried 
out within small geographic areas (balancing units) by an inflation of the 
design weight by the inverse of the household response rate. These adjust- 
ments are made on the basis of household counts independent of any charac- 
teristics of the household. Unless there is a high degree of correlation 
among households within balancing units, one would expect very little 


reduction in non-response bias by the present adjustment procedure. 


An indication of the magnitude of non-response bias under the current pro- 
cedure for compensation for non-response would be desirable. An explicit im- 
putation of missing information due to non-response on the LFS file can be 
obtained using procedures similar to those used in this study. After adjust- 
ments for complete non-response (i.e., non-response for all six months) survey 


estimates based on these comprehensive imputation strategies can be obtained. 
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Comparison of these resulting estimates with official survey estimates would 
provide added support to assessments of response bias which have been alluded 


to in this report. 


This report has provided justifications for considering various additional 
variables in the adjustment for non-response: month in sample, household size 
and labour force status. As there are substantial variations in the response 
rate by rotation number (month in sample) it is advisable to adjust for non- 
response within each rotation number separately. As the pattern of labour 
force characteristics for non-respondents exhibits a degree of variation over 
months in sample, an adjustment on the basis of rotation number should have 
some benefits for labour force estimates as well. As the greatest differences 
are between the first month and subsequent months in sample, an adjustment for 


these two classes may be sufficient. 


Among the non-responding households there are substantially more households of 
size one (and to a lesser extent for size two) than in responding households. 
Thus, household size is an important variable to be incorporated in any 
adjustment procedures for non-response. The analysis has_ shown. that 
discrepancies are the greatest for households of size one. It may thus be 
feasible to consider adjustments for two groups of households only, namely 
households of size one and households of size two or more. Incorporation of 
household size into compensation procedures for household non-response 
necessitates having some information available about the size of 
non-responding households. This may be explicit, as for example the household 
Size on a previous survey occasion, or implicit, as for example a distribution 
of non-responding households by size from previous surveys, or a distribution 
by household size from an independent source such as the Census. In either 
Situation, adjustments incorporating considerations of household size in con- 
Junction with adjustments by rotation number, should do much to alleviate dis- 
crepancies by rotation number in sample survey estimates of household and 


economic family units. 


As noted in Subsection 4.9, even within household size and month in sample, 


there are differences in the distributions of respondents and non-respondents 
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by labour force status. For the LFS there may be advantages in utilizing some 
variables relating to labour force activities in the adjustment process. 
There are two factors which tend to preclude this as being ‘viable “in 
practice. Namely, there is a desire for a general weight adjustment, not only 
for the LFS but also for the various supplementary surveys, and secondly, 
information at this level of disaggregation would be very unstable and neces- 
sitate adjustments at higher levels of aggregation. This new level of adjust- 
ment would negate any advantages which may currently be experienced due to 
local labour market phenomenon. Any compensation procedures must bear in mind 
the relatively low level of non-response currently experienced for the LFS. 
This has implications on the level of sophistication warranted, the potential 
for impact on the estimates, and the reliability of non-response information 


which would form a key part of the procedure. 


There are a range of possible alternatives to the present method of compen- 
sating for non-response. Further work in the development of other feasible 
compensation strategies is a two-staged process. The first stage is the simu- 
lation and evaluation of monthly labour force estimates based on the imputa- 
tion strategy suggested in this report. The second stage is the development 
of other non-response adjustment strategies followed by their empirical 


evaluation. Such work is in fact under way. 


FOOTNOTES 


[1] The estimates provided by the Labour Force Survey refer to the specific 
week covered by the survey each month, Reference Week, normally the week 
containing the 15th day. Survey Week, when all interviews are conducted, 


is the week immediately following Reference Week. 


[2] The survey universe for the Labour Force Survey is all persons in the pop- 
ulation aged 15 years of age or over residing in Canada, with the 
exception of the following: residents of the Yukon and the Northwest 
Territories, persons living on Indian Reserves, inmates of institutions 


and full-time members of the Armed Forces. 


[3] 


[4] 


[5] 
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Each month the interviewer is required to indicate whether a complete 
interview was obtained, that is, a complete Labour Force Survey question- 
naire was completed for each eligible household member; a partial 
interview was obtained, that is a questionnaire was completed for some but 
not all eligible household members; or no interview was obtained. When 
no interview occurs, the interviewer must indicate the reason for this 
Non-respondent households include those where no one was home (after 
several calls), the household refused to respond, the household was 
temporarily absent, or the interview was prevented by weather conditions, 
death, sickness, a language problem or other unusual circumstances in the 
household. Vacant dwellings include unoccupied dwellings, seasonal 
dwellings, dwellings under construction, dwellings occupied by persons not 
to be interviewed, and dwellings demolished, converted to business 


premises, moved, abandoned (unfit for habitation), or listed in error. 


For further detail on the LFS non-response adjustment see "Methodology of 
the Canadian Labour Force Survey, (1976), Stabistics Canada, Catalogue 


71-526 Occasional, October 1977, pp. 67-68. 


For further detail on the LFS weighting process see "Methodology of the 
Canadian Labour Force Survey, (1976)", Statistics Canada, Catalogue 71-526 
Occasional, October 1977, pp. 65-74. 
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TABLE 1. Percentage Distributions for Respondent and Non-respondent Households 
by Month in Sample for 1980 and 1981, Canada 


Month 

in Total Respondent Non-respondent Non-response 
sample rate 
1980 

1 16.6 16.1 28.6 6.94 
2 16.6 16.7 (pee) 3.84 
3 16.7 16.8 14.4 3.47 
4 16.7 16.8 14.3 3.45 
5 16.7 16.8 14.2 3.42 
6 16.8 16.9 12.6 3.03 
Total 100 100 100 4.02 
oe 

1 16.6 1630 i yage # 6.66 
2 16.7 16.7 16.6 3.42 
3 16.7 16.8 14.4 2.96 
4 16.7 16.8 1.9 2.83 
> 16.7 16-9 W251 2.48 
6 16.7 16.9 ltae) Leto 


Total 100 100 100 3.43 
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TABLE 4a. Determination of Family Type Composition Variable 


| 


Code Number of Size of Age of Presence Head is a 
economic economic head of OT member of 
family units Family unit family unit children a married 
in the household in the couple 

household 

es | a ee 

1 1 1 Zo 

2 1 1 25-64 

3 1 1 65+ 

4 1 2 45 No Yes 

> 1 UZ 45 No Yes 

5 1 2+ 45 Yes Yes 

7 1 2+ 45 Yes Yes 

j 1 2+ No Yes 

) 1 2+ No No 

10 1 2+ Yes No 

1 2+ all of,size 1 
ys 2+ all of size 2+ 

'3 2+ mixed 


| 
ABLE 4b. Percentage Distribution of Respondent and Non-respondent Households by 
Economic Family Type for 1980 and 1981 Annual Average, Canada 


| 1980 1981 
conomic Non- Non- 
amily Non-respondent Respondent response) Non-respondent Respondent response 
ype households households’ rate households households’ rate 
| 5.8 2.3 9.80 bat) 2.4 7.82 
21.0 9.0 8.51 23:64 239 7.71 
8.6 6.6 5.14 ed 7.0 4.47 
9.5 8.0 4.72 e) 8.0 4.05 
| Died 16 4.57 14.4 13.8 Suk 
| 18.4 28 «il 2-67 16.3 2720 2.10 
| 4.9 9.9 2.04 el 9.0 1.83 
| 4.6 8.2 LA oy 8 4.1 8.6 1.65 
| 3.1 4.3 Ze? al 4.3 2-45 
) by) 4.8 3.30 > 4.9 3.47 
| Dak 27 > 05 BaD) 2.8 4.25 
P 0.0 0.1 Ve25 0.1 0.1 Fare) 
y 104 2.1 2.47 1.2 2.1 2.06 
tal 100.0 100 .0 100.0 100.0 
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TABLE 6a. Percentage Distribution of Individuals by Age Group and Non-response 
Rates for Household Size and Response Status for 1980 Annual 
Averages, Canada 


Household size 


group a Ee 
1 2 3 4 5+ Total 

i 
Respondent 

0-14 0.0 2.6 4,8) JD:0 | 37 «1 24.3 
15-19 69 SID BZ S) ei We? S37) 
20-24 10.5 14.2 11.4 6.0 6.7 9A 
25-44 219 Za. D Bilge 35.4 2% 2 2IZ 
45-64 254 31.6 25.3 12.4 11.8 18.9 
65+ 34.3 22.8 D2 1.3 2.6 8.7 
Total 100.0 100.0 100.0 100.0 100.0 100.0 
Non-respondent 

0-14 0.3 2.6 22.6 37.0 40.8 19.4 
15-19 3.0 3.8 8.1 8.8 De TS 
20-24 13.6 14.4 12.0 4.8 Die. 7) 10.4 
25-44 355 PB: 33.5 37.4 26-6 31.6 
45-64 Do. 2 33.0 19.6 10.7 10.2 20.9 
65+ 22.4 19.1 4.1 1.4 1.1 10.2 
Total 100.0 100.0 100.0 100.0 100.0 100.0 
Non-response rates 

0-14 - 4.52 P30 Dao 2.05 2/50 
15-19 11.07 4.83 2.95 Zo22 Veg6 2.42 
20-24 9.49 7.94 3.16 1.95 1.60 5.54 
25-44 2255 4.83 Be2A 2.58 1.98 3.38 
45-64 7.45 4.68 Zaae Zeist 1.62 3.45 
65+ 5.01 3.80 2.41 2-47 0.85 3.65 


Total 7.48 4.50 3.00 2.44 1.87 Dr AD 
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TABLE 6b. Percentage Distribution of Individuals and Non-response Rates by Age 
Group for Household Size and Response Status for 1981 Annual 
Averages, Canada 


Household size 


Age 
group 
1 Z 3 4 5+ Total 

Respondent 

0-14 0.0 Zl 19.8 34.4 Wein) 23.6 
15-19 1.8 3.4 8.6 269 16.1 9.5 
20-24 10.6 14.1 11.4 6.4 7.1 9.4 
25-44 28.8 Layee) 31.8 be ry4 25.8 29 «6 
45-64 24.4 ls 7 25.62 12.6 11.6 19.4 
65+ 34.3 VDL 50 ios Lied 8.9 
Total 100.0 100.0 100.0 100.0 100.0 100.0 
Non-respondent 

0-14 0.1 Died 23.6 Di» 2 39.4 pier 
15-19 Z.0 Dal TD 8.8 15.6 6.9 
20-24 13.0 15 <1 ZO Se) Deu 10.9 
25-44 39 .6 28 .6 34.3 DEloe: fades! 33.0 
45-64 27.09 30.1 19.8 10.1 10.2 19.9 
65+ 22.2 WS) 2.8 0.9 eZ 10.5 
Total 100.0 100.0 100.0 100.0 100.0 100.0 


Non-response rates 


0-14 - 516 Lele 2.05 esa: 2 AZ 
15-19 TE 3.4] 2.05 1.69 1.61 T.92 
20-24 Tag 4.02 2-47 1.77 1.37 3.04 
25-44 8.85 4.14 Zed 2.01 1.78 2.91 
45-64 6.20 3.5/7 | Rae) ee) 1.46 2.74 
65+ 4.38 Dez? es Ved 0.83 3.08 
Total 6.58 3.16 Lie ID 120 1.66 2.63 
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TABLE 7a. Percentage Distribution of Individuals and Non-response Rates by Age 
Group for Month in Sample and Response Status for 1980 Annual 
Averages, Canada 


| a 


Month in sample 


1 2 3 4 5, 6 Total 
SR a a 
Respondent 
0-14 24.2 24.1 Z4 oD 24.4 24.5 24.5 24.3 
15-19 Qo9 9.8 Sed) Boll ST 9.6 Shed 
20-24 9.2 2 oh Dae 4 9.1 9.0 Pel 
25-44 Zo oI ZI 27 «2 LI 2D LP «LZ 29.1 29.2 
45-64 18.9 10.9 10.59 {jist 18.9 19.0 18.9 
65+ 8.7 8.8 8.7 8.7 8.7 8.7 8.7 
Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0 


Non-respondent 


0-14 18.9 19.0 19.6 19 <2 20.0 19.9 19 <4 
15-19 7.6 6.3 7.8 View 7.6 129 eee 
20-24 10.6 10.3 VGH 10.3 10.5 10.3 10.4 
25-44 2 Pe 55.1 30.8 30.6 31.4 31.5 31.6 
45-64 20.4 20.6 21.8 21.5 20.9 Zee 20.9 
65+ 10.6 10.8 10.0 10.4 9.6 IE cUEZ 
Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0 
Non-response rates 

0-14 4.22 ZeLD 2.18 2.16 2.24 2.00 2.50 
15-19 4.12 1.84 7 les) 2.16 Zio Ad, 2.03 2-42 
20-24 6.12 3.16 2.94 3.04 3.16 2.78 354 
25-44 5 «83 Boca 2 83 2.81 2.93 2.65 3.38 
45-64 Del 3.08 dO 3.06 3.00 Dish | 3.45 
65+ 6.40 3.00 3.06 3.22 3.01 2.69 3.65 
Total 534 2.84 2.69 2.70 Led 2.46 ob) 
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TABLE 7b. Percentage Distribution of Individuals and Non-response Rates by Age 
Group for Month in Sample and Response Status for 1981 Annual 
Averages, Canada 


i 


Month in sample 


Age 

group 

1 Z B 4 5 6 Total 

Respondent 

0-14 23.4 23.4 2oRS2) 23.6 ZT 23.8 23.6 
15-19 9.1 9.6 Ie? Di 9.4 9.3 9.5 
20-24 9.4 9.4 9.4 9.4 9.4 202 9.4 
25-44 29 oD Zo 37 AAT 29.6 29 .6 29.6 29.6 
45-64 1p <1 lol 19.0 19.1 19.1 19.1 196A 
65+ 8.9 8.9 8.9 8.9 8.9 8.9 8.9 
Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0 


Non-respondent 


0-14 18.1 18.7 18.3 19.8 19.8 20.1 18.9 
15-19 6.9 6.2 6.8 6.6 7.3 76 6.9 
20-24 10.9 10.5 ipl 11.0 11.2 10.7 10.9 
25-44 33.5 Del. 32.0 32.6 BD <i 32.6 33.0 
45-64 20.2 20.5 20.8 19.2 18.9 19.0 1929 
65+ 10.4 11.0 11.0 10.8 9.8 Dred 10.5 
Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0 
Non-response rates 

0-14 3.96 2.03 Aesth 1.85 1.64 1.57 ZeAZ 
15-19 3.6/7 1.66 lee] 1.55 1.54 1.50 Vez 
20-24 By o)) 2.80 Ze 2.58 2.34 ZeAZ 3.04 
25-44 5 .6/ Loe 2.35 2.42 2-20 2.03 209" 
45-64 Dee Ze el ews) Zee | 1.94 1.86 2.74 
65+ 5.88 3.10 2.68 2.66 72P i) 2.01 3.08 
Total 505 Ze ID 2.18 2.20 1.96 1.85 2.63 
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TABLE 8b. Unemployment Rates by Household Size, for Month in Sample and 
Response Status for 1980 and 1981 Annual Averages, Canada 


Month in sample 


Household Response 1 2 3 4 5 
size status 
1980 
1 Respondent Dee Dro Lh Doh 5 B35 Ae), 
Non-respondent 6.56 era 10.16 9.44 BA wae 
2 Respondent 6.94 6.95 6.46 6.54 7.01 
Non-respondent 6.59 6.81 7.58 8.7/ Tet 
b Respondent 8.14 7.76 hed A 7.71 8.04 
Non-respondent (omen he: NOS) Del a Wat A 8379 
4 Respondent 7.94 7.24 6.595 6.84 6.71 
Non-respondent 6.71 742 10.09 Uetz 9.05 
5+ Respondent ere al IAG 8.88 8.54 ‘ok its 
Non-respondent Be7z T649 6.50 Fat? 12.85 
Total Respondent 7.83 7.63 Pot] 7.28 Drage 
Non-respondent Ge0Z WASS G45 8.60 Pad 
1981 
1 Respondent 6.56 5.86 6.22 5.8/7 6.16 
Non-respondent Dee 6.07 Dy 9.46 Tao, 
Z Respondent 6.27 6.38 6.01 6.13 vane 
Non-respondent oy aey| 6.995 Tea ks Aree 9.06 
3) Respondent Tho Lead Taste vee VG 
Non-respondent 6.76 7.46 etl 2 BregliZ 102335 
4 Respondent 7-00 eZ TREAS 7.02 6.07 
Non-respondent 6.28 6.05 5.88 8.07 9.07 
5+ Respondent 2.24 8.90 8.78 8.5/7 8.26 
Non-respondent 7 = 16 Go. 2G Sez, Gree 10.84 
Total Respondent ASE eo? 7.28 7.14 7.16 
Non-respondent 6.05 6.89 6393 8. 30 9.36 
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TABLE 9c. Household Non-response Rates by Type of Area, Household Size, and 
Month in Sample for 1980 and 1981 Annual Averages, Canada 


ee 


Type of area 


1980 1981 
Month Household 
in size SRU NSRU NSRU SRU NSRU NSRU 
sample urban rural urban rural 
1 1 do79 11.02 11.62 (Bs) hare Pet 
2 7.82 ise A 7.05 e998 6.84 5.90 
3 4.98 6.05 > «03 4.93 562 502 
4 4.03 4.48 4.15 5.59 3.76 345 
5+ 3213 4.01 5.45 3.21 3.10 2.61 
Total Bby?2 6.65 Dell WEies, 6.42 4.95 
2 1 8.21 6079 7.20 6.30 6252 6.56 
2 4.30 4.41 4.13 BARNS 3.77 Bey) 
3 2-44 312 2.82 bed Dyas 2.69 
4 1.90 2.48 1.96 A 4 2.56 aA 
5+ 1.63 ye) le 4 1.95 1 5O7 1.36 
Total 3.98 Biegs Je 3.26 be] 3.50 3.01 
3 1 6.68 6.3535 6.50 655) 561 eZ 
Z 3.83 3.67 3.88 3.44 3.20 2.81 
3 2.48 3.18 2.78 1.58 ‘1552 2.10 
4 2.08 2.60 Zotd, 1.46 1.21 195 
5+ Too T6081 1.87 Teo 2.16 lou 
Total De5S 3.60 3.20 Baie) 2.81 Zool 
4 1 6.41 6.58 6.04 5.50 526 5 02 
2 4.00 3.60 Sa | 3.06 yet 2.90 
5) 2.44 2.71 2.91 IANS) 1.99 1.3 8i4 
4 2.02 2.82 2.07 Va74 1.66 duet) 
5+ 1.47 1.85 1.90 1S 59 1.23 
Total bee) 5-61 Ball 2.99 2.83 2659 
5 1 6.04 5294 551 G59 4.47 {F339 
2 4.05 erie 568 | Zeta Sn he, Za 
3 251 PMT ES) Looe 1.60 2.18 1.86 
4 ZZ 2 2.56 2.34 140 2.10 1.52 
5+ 1.41 (57/9 1.94 1.64 ile / (Pra a 
Total Died 3.50 3.14 Jeo) Ze12 2.14 
6 Die tZ 54 30 92 -48 24 
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ROTATION GROUP BIAS IN THE LFS ESTIMATES ! 


P.D. GHANGURDE 2 


The paper attempts to evaluate the impact of non-response 
adjustment by rotation groups on rotation group bias in the 
estimates from the Canadian Labour Force Survey. Results 
on bias and non-response characteristics are presented and 
discussed. An index used to measure rotation group bias is 
given and some empirical results are analyzed. 


1. INTRODUCTION 


In the Canadian Labour Force Survey (LFS) sample design each month one-sixth 
of the households rotate out of the sample and one-sixth rotate in. The 
sample is thus composed of six panels or rotation groups. In any given month 
households in a rotation group have been in the survey from one to six months, 
including the current month. It is well-known that in household surveys with 
rotation sample designs estimates for the same characteristics from different 
rotation groups could have different expected values. This phenomenon, called 
rotation group bias, has been studied for the LFS and other household surveys 


with rotation sample designs (see [1], [5], [7] and [8]). 


Rotation group bias can be attributed to several factors. In .the SUF Seine 
non-response rates at household level are known to differ between rotation 
groups i.e. number of months a household is in the survey. It is also known 
that non-respondent households tend to have different characteristics as 
compared to respondent households. Both these factors can contribute to 


bias. Due to conditioning of the respondent or familiarity with the survey 


1 Presented at the American Statistical Association Meeting in Cincinnati, 
August 1982. 

EXD: Ghangurde, Census and Household Survey Methods Division, Statistics 
Canada. 
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over a period of six months, response bias in the data from successive months 
can be of different magnitude. There is some evidence from the LFS reinter- 
view data of such differential bias over the period of six months. However, 
in the literature it has also been hypothesized that rotation group biases can 
be attributed to differences in non-response probabilities between rotation 
groups [7]. Although individual probabilities are not known, their averages 


can be estimated by non-response rates. 


In this paper an attempt is made to evaluate the impact on rotation group bias 
of non-response adjustment by rotation groups. In section 2 some results on 
bias are introduced and their implications on the bias in the estimates from 
different rotation groups are discussed. Section 3 presents some data on 
nonresponse rates in the LFS and characteristics of respondents and 
non-respondents by months in the survey and their contribution to rotation 
group bias. Section 4 explains the adjustment of LFS weight for non-response 
by rotation groups and its impact on the rotation group bias and an index used 
aS a measure of rotation group bias. In section 5, some data on the index for 


labour force status categories, based on 1981 surveys, are analyzed. 
2. THE STATISTICAL MODEL 


We introduce a model which provides expressions for contribution to bias of 
differences in non-response rates, differences in characteristics of 
respondents and non-respondents and response bias for any groups of the sample 
in which adjustment of weight for non-response can be done. Rotation groups 


Can be considered as a particular case of these groups. 


A population of size N is assumed to be divided into "strata" of respondents 
and non-respondents of sizes N, and N»y respectively. A simple random sample 
of size n is drawn and responses are obtained from nj, units and (n-n j) units 


are non-respondents. 


Suppose the sample can be divided into K groups such that non-response rates 
and characteristics of respondents and non-respondents differ between the 
groups. The data collection methods used in these groups and the extent of 


Conditioning of respondents or their familiarity with the survey could be 


Ane 


different leading to differences in non-response rates and characteristics 
and also possibly to different response biases. By an extension of a result 
in [2] and [6] to include response bias component, the bias of the sample 
mean y of ny units (without adjustment of weight for non-response within 


groups) is given by 


K 
SOD cee sty py isd) eaeeneue deabllaait wile pica g! 
R i=1 i=1 
] s D 
+ = » Rea ulti 317 
R 14 


where Y14 and Y24 are population means of respondents and non-respondents in 


the ith group, Ry, response. rate, for the jth group, Pj, proportion of 


total population in the jth group, B» mean response bias in the jth group 
K 

and R= £ P, R,, overall response rate. 
i=1 


The above expression shows the decomposition of bias into three components. 
The first shows contribution of differential response rates, the second due to 
differences in characteristics between respondents and non-respondents and the 


third due to response bias. for simplicity, we consider in this paper charac- 


teristics based on attributes, e.g., proportions of "employed" and 
"unemployed". We now consider the estimate Ya? with adjustment for non 
response by inverse of response rate done within each group. Thus 
1 K 
ate ay ae Dale) anes 
a n ia] ol ak 


where n.; is sample size in the ith group and y. is mean of ny units 
ail p g p y 4) 


: Aes i 
in, the i Group... [he bias on ie is given by 


K K 
Bn) = ae PEGS mend 2 Oh sara ce (2) 


sf H9) 


The first component of bias in (1) due to differential response rates between 
groups is eliminated, the second component due to differences in character- 
istics remains the same and the third component due to response bias could be 
different from that in (1). 


Based on a framework of response non-response error model involving response 
probabilities at unit level, the bias has been decomposed into components due 
to non-response and response errors [3]. The above decomposition of bias does 
not use response probabilities at the level of individual units but is Simple 


enough for empirical evaluation of the components. 


If response rates do not differ between the groups the first component is zero 
so that, (1) is identical to (2); hence non-response adjustment within the 
groups does not lead to reduction in bias. The difference in the bias of y 


and iy. is given by 


Bc yea (yg) an Pio(Ri@= RY AG! 47, (3) 
R 


ren 
i 
= 


Thus if response rates are different, and Y,; and B do not differ between 
the groups, there is no change in the bias after non-response adjustment 
within the groups. If the means iia and Bi differ between the groups 
there is a decrease in bias if the term on the right-hand side of (3) is 
positive and an increase, if it is negative. The change in absolute bias from 
[B(y) | to [B(y.) | as result of adjustment will depend upon the sign and 
magnitude of the term on the right hand side of (3). 


The bias of estimate of mean for ith rotation group, without adjustment and 
with adjustment of weight for non-response by rotation groups, is obtained 
from (1) and (2) by simple substitution of Pj; = 1 and keeping the terms 
corresponding to the rotation group. Also, from (3) the difference in biases 


of estimate for ith rotation group is given by 
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where yj and Ven are estimates for ith rotation group before and after 
adjustment. Assuming Yai aa) > Oa tence heel at Ri<R, the bias for ith 


rotation group increases after adjustment and if Rj > R, it decreases. 


Since the population of respondents in a survey month is the same for various 
rotation groups, it may be argued that the proportions Yai could be the same 
for all rotation groups or months in the survey. However, the differences in 
exposure to survey or conditioning of the respondents can produce different 
response biases, £,, between rotation groups. Thus the difference in the 


bias of y and y, is given by 


B(y) - B(Ya) = 2 Py (Ri - R) Be (5) 
Reo 2] 
However, the difference in bias of estimates for rotation group i is given 
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It may also be noted that under the assumption of constant Tee and 6," for. alu 
i and differential response rates, non-response adjustment by rotation groups 
does not change the bias of estimate based on all rotation groups. However, 
the change in the biases of individual rotation groups after non-response 


adjustment are accounted for by different response rates. 


The above results are useful in the evaluation of contribution of various 
factors to rotation group bias and the impact of adjustment of weight by 


rotation groups on the estimates of rotation group bias. 


The LFS is a monthly national household survey with a sample size of 55,000 
households. Each of the ten provinces in Canada is divided into economic 
regions, which consist of groups of counties with similar economic structure. 
The economic regions are divided into homogeneous strata on the basis of 
distribution of employed persons in various industry-occupation groups in the 


last Census. The sample design is stratified multi-stage sampling with two 


a 


stages in the self-representing (SR) urban areas and three or four stages in 
the non-self-representing (NSR) rural areas of the design. The sample selec- 
tion in the initial stages is with probability proportional to population size 
and that in the last stage, where dwellings are selected from clusters, being 
systematic. The selected clusters are assigned six rotation numbers indepen- 
dently within each stratum. In any survey month one-sixth of the households 
have been in the survey from 1 to 6 months. Thus the entire sample is divided 
into six equally representative sub-samples of equal sizes [4]. The rotation 
numbers for six rotation groups can be converted to number of "months in the 


survey" by a simple transformation. 


The adjustment of weight for non-response is done for the entire sample in 
balancing units by ratio of households in the sample to responding households. 
In the NSR areas each primary Sampling unit (PSU) is divided into two balan- 
cing units consisting of urban and rural parts. In the SR areas of the 
design, strata (called sub-units) form balancing units. The number of balan- 


cing units thus exceeds 900 in NSR areas and 800 in SR areas. 


In order to evaluate the rotation group bias in the LFS estimates, with and 
without adjustment, data on non-response rates (1-R;) and “Y44. and Nois 
proportions for the characteristics "employed" and "unemployed" for respon- 
dents and non-respondent respectively in twelve surveys in 1981 are presented 
and analyzed in Section 3. The "months in the survey" represents number of 
months (including the current month) a rotation group is in the 


Survey. No data on response biases, “Bj, are presented. 
3. ANALYSIS OF LFS DATA 


Table 1 shows average non-response rates, (1-Rj), by months in the survey 
for calendar months in 1981. It can be seen that the rates differ substan- 
tially between the two areas and between months in the survey for a given 
area. In both the areas and at Canada level, non-response rates are high in 
the first month, decrease substantially in the second month and decrease slow- 
ly over the succeeding months. The high non-response rates in the first month 
are contributed by "temporary absent" and "no-one-at home" type households. 


In the later months the rates reduce due to interviewer's knowledge about the 
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best time to call on these households. The rates are higher in SR areas, 
especially apartments (not shown in the table) as compared to NSR areas. 
During processing, for approximately 1/2% households data are carried forward 
from the previous month. The non-response rates presented in the tables are 
obtained by considering those households as respondent. It may be noted that 
difference of rates from their mean (Rj - R), is negative in the first and 
in some cases in the second month in the survey and positive in the following 
months. The mean rate R is approximately equal to R»z. Thus from (4) relative 
bias for first month in the survey is expected to increase, if Gia + 5) 
and population mean Ye are assumed constant; for months 3 to 6, the relative 


bias is expected to decrease after adjustment of weight for non-response. 


Table 2 shows estimated proportions, Vai and You of employed and_= unem- 
ployed heads of households by months in the survey for respondent and 
non-respondent households respectively. The estimates were obtained from LFS 
longitudinal files for the period March - August 1976 and are based on 
unweighted counts. The data on non-respondents, who responded at least once 
during the six month period, were obtained from months in which they responded. 
Non-respondent households tend to have greater proportion of employed heads 
and lesser proportion of unemployed heads as compared to respondent house- 
holds. It is known that the difference of proportions between respondents and 
non-respondents for employed persons tends to be 0.10 and that for unemployed 
persons tends to be about 0.005, the signs of differences remaining the same. 
No particular trend over months in the survey can be observed in the propor- 


tions of employed and unemployed heads among respondent and non-respondent 


households. 


The contribution of the first month to the first component is negative in all 
calendar months for both unemployed and employed. This indicates that the 
bias for the first month in the survey is expected to increase after adjust- 


ment for non-response. 


The analysis in sections 2 and 3 isolates rotation groups as groups considered 
for non-response adjustment. For real data, the same relative changes may not 
be seen due to impact of differential response rates in other groups and 


changes in magnitude oF and B during the six month period. In section 5, 
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we analyze the impact of non-response adjustment by rotation groups’ on 
rotation group bias in the LFS estimates and attempt to explain the results on 
the basis of the model. 


It may be noted that non-response adjustment in the present weighting of LFS 
data is done within balancing units which are much smaller than NSR and SR 
areas within a province. Thus the estimates of rotation group bias based on 
the present weighting and non-response adjustment are corrected for differen- 
tial non-response rates between the two areas but not for those between 


rotation groups. 
4. WEIGHT-ADJUSTMENT BY ROTATION GROUPS 


The LFS final weight is composed of five factors: (1) mathematical weight, (2) 
Tural-urban factor, (3) cluster sub-weight (4) balancing factor and (5) age- 
sex factor. The mathematical weight for a household is the inverse of overall 
Sampling ratio for the household, based on the sample design. Within each 
province the weight is the same within urban (SR) and rural (NSR) strata 
except in a few cases, resulting in twenty areas at Canada level with the same 
mathematical weight. The cluster sub-weight is the inverse of sampling ratio 
within a cluster. The balancing factor adjusts the weight for non-response 
and age-sex factor is a ratio adjustment factor based on projected population 


within age-sex groups at province level. 


As explained in section 2, adjustment of weight for non-response is done with- 
in balancing units for the sample of households. For the evaluation of impact 
of weight adjustment by rotation groups, it was decided to use progressively 
smaller areas (as balancing units) starting with rotation groups at province 
level. The adjustment of final weight within rotation groups in these areas 


was done by multiplying by adjustment factors: 


ROG) = _respondent households in the sample 
x respondent households in rotation group (i) 


R Gy ate respondent persons in the sample 
i respondent persons in rotation group (i) 
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The first factor weights up the estimate of households within a rotation group 
in a balancing unit to the level of sample of respondent households. The 
balancing factor weights it up to the level of sample of households within the 
balancing unit. The second factor, based on the count of respondent persons 
weights up the estimates to the level of the entire sample of respondent per- 
sons and thus corrects the estimates for different household sizes or coverage 
of persons within households. It is known that non-respondent households tend 
to have smaller sizes as compared to respondent households. The difference in 
non-response rates between rotation groups may result in differences in ave- 


rage household sizes. 


If Y(i) is estimates total of ith rotation group and Y(i), true value of 
ith group total, then the estimate of relative bias of estimated total of 


ith rotation group is given by 


~ 
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Since Y(i)'s are not known and can be assumed to be approximately equal (since 


rotation groups have equal expected sizes at large area level) Y(.), the mean 
of six rotation group total estimates can be used in place of Y(i). The 


rotation group bias index for it, rotation group is given by 


Iy(i) se 1007 ee) le ee U0 (7) 


It may be noted that, since the mean of estimates of six rotation group totals 
is used instead of true values, Iy(i) may be biased but is useful as a mea- 
sure for evaluation of difference in relative biases between rotation groups 
for various sub-groups of the population and adjustment of weight based on 
household and person counts. Similarly, Py(i), the rotation group bias of 
population estimate can be defined for individual rotation groups. The values 
of the index Iy(i) above 100 indicate positive relative bias and the values 
below 100 indicate negative relative bias. Similarly, the index Ip(i) can 


be interpreted. 
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5+ ANALYSIS OF DATA ON ROTATION GROUP BIAS INDEX 


In the following tables data on rotation group bias index for population and 
labour force status categories by type of area and age-sex groups are present- 
ed and analyzed. The index values are obtained by using final weights and the 
same adjusted for non-response by rotation groups using each of the two 
factors based on household and person counts. A comparison of index values 
based on adjusted and unadjusted weights is used in evaluation of impact of 
weight adjustment on estimates of rotation group bias. The adjustment of 
weight by rotation groups, using household counts, was done at province level. 
Thus the final weights for households inthe six rotation groups in each province 
were multiplied by adjustment factors Ry(i); i = 1,2,...6. Similarly, the 
adjustment based on count of persons was done at province level by factors 
Ro(i); i = 1,2,...6. In order to evaluate the impact of these adjustments 
on estimates of population we present Table 3 showing rotation group bias 
index for population estimates by type of area and months in the survey for 
twelve surveys in 1981. The index values based on unadjusted weight indicate 
that there is relative underestimation of persons in the first and the sixth 
month in both SR and NSR areas. The index values based on weight adjustment 
using household counts show some improvement in bias; however, this adjustment 
assumes that household size is the same in six rotation groups. The index 
values based on weight adjustment using counts of respondents are closer to 
100.0 in both the areas, as compared to those based on household adjustment. 
Thus, the adjustment based on count of persons seems to correct the estimates 
for differential bias better than the adjustment based on household counts. 
The higher index values in earlier months and lower in later months could be 


due to changes in size of non-responding households by month in the survey. 


Tables 4 and 5 present data on average index values by type of area and 
age-sex groups for twelve surveys in 1981. Index values by type of area based 
on unadjusted weight indicate that relative bias of estimates of unemployed 
tends to be positive in the first two months and shows a decreasing trend in 
the later months. Those for employed and in labour force tend to be negative 
in the first month and positive in the following months. Data on index values 


by age-sex groups show similar trends as those by type of area. 
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The adjustment of weight for non-response based on household counts tends to 
increase the index values in the first month and also fifth and sixth months. 
The index values in other months tend to decrease. This is true for index 
values for labour force status by type of area and age-sex groups. The 
increase in index values in the first month can be attributed to lower than 
average response rates and the decrease in index values in the following 
months to higher than average response rates. The decrease in the last two 
months can not be explained on the basis of higher than average response rates 
aKf (Ys + B;) is assumed constant. 

The adjustment of weight for non-response based on count of persons tends to 
increase the index values in the first month and decrease the index values in 
the third to sixth month. The index values for the first month based on 
adjustment using count of persons tend to be greater than those based on 
household adjustment. The adjustment based on count of persons seems to 
correct the estimates for differential response between rotation groups. The 
response rates are low in the first month resulting in increase in relative 
bias after adjustment. The decrease in the relative bias in the third to 
sixth month seems to be due to lower than average response rates at household 


level, corrected for differential household size between rotation groups. 
6. SUMMARY AND CONCLUDING REMARKS 


This paper considers a model which decomposes overall bias into three compo- 
nents, showing the contribution due to differences in response rates, response 
biases and characteristics of respondents and non-respondents between groups 
of a sample. Rotation groups can be considered as a particular case of these 
groups in which adjustment of weight for non-response can be done separately. 


The model also shows contribution of various factors to rotation group bias. 


If response rates differ between rotation groups, and the proportion of a 
characteristic for respondents and the associated response bias is equal for 
all rotation groups, non-response adjustment by rotation groups does not 
change the bias of estimates. However, rotation group bias can increase or 
decrease, according as response rate is lesser or greater than the mean 
response rate. This is corroborated by data on index values before and after 


adjustment of weight, based on count of persons. 
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It is proposed to analyze index values for labour force status and other cha- 
racteristics for larger data sets and to study the impact of differences in 
average household sizes between rotation groups and respondent and non- 
respondent households on estimates of rotation group bias. The contribution 
of differential response rates and response biases to rotation group bias, 


after adjustment for non-response by rotation groups, will also be analyzed. 
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EF 1. ‘% Non-Response Rates for Households by Months in Survey and Type of Area (1981) 
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- 2. Estimated Proportions of Employed and Unemployed Heads in Respondent and 
Non-Respondent Households 


1S Respondents Yaa Non-Respondent You be - 94 

Employed Unemployed Employed Unemployed Employed Unemployed 
a ee ee 2 ee . We. SOR 

0.6893 0.0383 0.7839 0.0335 -0.0946 0.0048 

0.6962 0.0344 0.7841 0.0321 -0.0879 0.0023 

0.7006 0.0311 0.7851 0.0300 -0.0845 0.0011 

0.7006 0.0364 0.7877 0.0281 -0.0871 0.0083 

0.6972 OF 0517. 0.7821 0.0317 -0.0849 0.0000 

0.6927 0.0331 0.7767 0.0320 -0.0840 0.0011 
a a a a a rn a eee ne eee 
me 0.6961 0.0342 0/835 0.0311 -0.0872 0.0031 


——— SS ee a a ee ee eee eee eee 


TABLE 3. Rotation Group Bias Index 
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IBLE 5. Rotation Group Bias Index by Age-Sex Groups (1981) 
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COMPUTERIZATION OF COMPLEX SURVEY ESTIMATES ! 
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Survey data collected by statistical agencies is most likely to 
be processed through to the tabulation stage by these agencies. 
The computer programs associated with this processing are also 
most likely tailored to the particular design and variables used. 
The statistics computed from such surveys typically range from 
simple descriptive totals and means to these required for 
analytic studies such as comparison of domains, regression 
analysis and contingency tables analysis. This paper describes a 
computer program which computes these statistics and their 
associated sampling errors for commonly used sampling designs. 


1. INTRODUCTION 


A variety of statistics are computed for survey data which often arise from 
large, complex national and regional surveys. The statistics computed from 
such surveys typically range from simple descriptive totals and means to those 
required for analytic studies such as comparison of domains, regression 
analysis, and contingency tables analysis. Domain estimation refers to the 
estimation of statistics for subgroups of the population of interest which are 
not explicitly provided for in the design. Yates (1960) contains considerable 
material on the estimation of domain means and their differences. Hartley 
(1959) and Rao (1975) provide an excellent account of the methodology used for 
domain estimation. The variance estimators associated with the domain 
estimators are easy extensions of variance estimators for simple statistics. 
This is not, however, the case for more complex statistics. The estimation of 
regression equations from survey data presents several problems; for example, 
the definition of the regression equations, the identification of the 
population for which inferences are desired, and the variance estimation for 


the regression coefficients (see Konijn (1962), Kish and Frankel (1974) and 


1 Presented at the Annual Meetings of the American Statistical Association, 
Detroit,..August-—1984. 
M.A. Hidiroglou, Business Survey Methods Division, Statistics Canada. 
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Fuller (1975). The testing of hypotheses for contingency tables given survey 
design considerations have been studied by Nathan (1969, 1972), Rao and Scott 
(1981), Garza-Hernandez and McCarthy (1962) and Koch, Freeman and Freeman 
(1975) to name a few. 


Survey data collected by statistical agencies is most likely processed through 
to the tabulation stage by these agencies. The computer programs associated 
with this processing are also most likely tailored to the particular design 
used. It is quite possible that computer programs used to produce estimates 
of totals (say) and their associated variances must be developed from scratch 
every time that a new survey design is introduced. This is time consuming, 
expensive, tedious and in some sense repetitive. Use of statistical software 
packages such as SPSS or SAS may be considered as an alternative. These 
packages may be readily used to produce weighted'estimates. However, the 
variances that they compute do not take sample design factors such as 
stratification and clustering into account unless they are programmed to do 
so. A user must therefore be fairly familiar with the language used by these 


packages if he wants to obtain proper variance estimates for survey estimates. 


Recently, there have been attempts to develop programs which compute variances 
for a general class of designs. Some of these programs are STDERR by Shah 
(1974), SURREGR by Holt (1975), SUPER CARP and MINI CARP by Hidiroglou, Fuller 
and Hickman (1980). These programs basically require the specification of the 
estimator to be used and the variables to be analysed. It will be assumed 
that the data sets that these programs are being applied to have been edited 
and that missing observations have been imputed. In this paper, SUPER CARP 
and MINI CARP will be described. SUPER CARP can be used to construct 
estimated totals, ratio estimates, the difference of ratio estimates and 
contingency tables tests for multistage stratified samples. It contains a 
number of regression procedures appropriate for data observed subject to 
response (Measurement) error. Covariance matrices can be estimated for sub- 
population means, and totals and for stratum means and totals. MINI CARP is a 
smaller program which differs from SUPER CARP in that it does not contain and 
of SUPER CARP's regression procedures. A comparison of the capabilities of 


the two programs is given in Table 1. 
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TABLE 1. Capabilities of SUPER CARP (S$) and MINI CARP (M) 


Multivariate For, 
Estimate eee eee ee 
of Entire Individual Sub- 
Population Strata population 


Simple Parameters 


. Means S,M S,M S,M 
. Totals SM SM S,M 
. Ratios SM S,M S,M 
. Difference of Ratios S,M SM SM 
. Proportions SM S,M S,M 


Complex Parameters Tests 
. Weighted Least Squares S . Regression 
. Weighted Errors-in-the Coefficent 3 
Variables (Known & Estimated - Goodness-of-fit SM 
error covariances) S - Inmdependence for 
Two-Way Table S,M 


2. GENERAL DESCRIPTION 


2.1 Notation 


In general, SUPER CARP and MINI CARP can accept data from a multistage 
stratified design. Assuming that the design has s stages, a g dimensional 


data vector is read in for each observation. We denote this data vector as 


(Z = 9 Z : 9 eeeg Z : ee 
hit hi2 hig 
where Nes. li, 2s. 2.5, i denotes istratdra ea (Clow... lS) Cep lesen bammne 


SL QQC Sees l ee lsd sire cues n, represents the first stage identification; i»5= 1, 
2, sees Ohi) represent . thes seconds stage identi fications. eos = ne 
esis 5 Te represents the last s-th identification. Zni. k is the hi .-th 


observation for the k-th variable of interest. Weights associated with the 


Mhi.-th observation will be referred to as Whi These weights would be inver- 


sely proportional to the selection probabilities of each ultimate sampled 
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unit, The specification of the variables to be used in the analysis (be it 
total or ratio estimation or regression estimation) is done by using a selec- 
tion vector y = (v, 9 Vea cess ars ) where 1 < V.< g Toveke=e (i567, eee, 
p+ 1. Given that the type of analysis and the identification of the vari- 
ables has been decided upon, let the chosen vector for the hi .-th observation 
be 


9 i . X : eee x A 
hit, hi.2, , hi_g’, 


where Y denotes the dependent variable and X denotes the independent variables 
if regression analysis is specified. Note that Vy is always the index for the 
dependent variable in the case of regression. For other types of analyses, 
the ordering within the selection vector is not important. 


2.2 Types of Computations 


The simple statistics and a partial list of the regression options available 
in the program are outlined. A complete descritpion of all the available 


options is written up in the SUPER CARP or MINI CARP manuals (1980). 


(i) Total Estimator, e.g. 


The estimated covariance matrix for 


A A A 


“A . 
X = Keqys X¢o)s ee X(5) } is 


(ha. She) hy, dy.) (22619 


=: OG: = 


where 
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hi, (k) er igo hig hi (k) 
So n 
—— 2 =| 2 
eg onan 2 oni : 

1,=1 1 


Note that the above variance formula may be applied to pps schemes with and 
without replacement. For with replacement schemes, only the first stage vari- 
ance needs to be computed (Des Baj, 1968, pg. 120) and the correction factors 
FL are set to zero. In large scale surveys, it is often assumed that the 
first stage clusters have been selected without replacement even though the 
actual selection scheme may have been without replacement. This assumption 
inconjunction with small sampling fractions implies that resulting variance is 
fairly close to the one which would have been obtained by taking all stages 
and selection prodcedure sinto account. If the sampling fractions are not 
negligible at each stage and that the sampling has been performed using with- 
out replacement S.R.S. at each stage, Des Raj's rule (1966) can be used to 
advantage to compute each stage component of covariance. The covariance ma- 


trix accounting for s stages is: 


where for r > 2 


EHO] <- 
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The variance estimation for an r-stage design can therefore be done by esti- 
mating the components at each stage (v w(K) and summing them up. This can be 
done by passing over the data set r ie times. The first time around, 
strata and first stage units are read into the program to give v aXe The 
second time around, the original primary sampling units are oye) into the pro- 
gram as "strata" and the secondary units are identified as clusters to give 

Vv REor The r-th time around, the original (r 2 2) (r-1)-th stage units are 
ae into the program as "strata" and the r-th stage units are identified as 


clusters to give v 6X) « 


On each pass a sampling rate Shi must be read in for the hie -th unit 


] 
-r-] 
where 4 


r=2°"  hik hi 

Shines Neees ees 
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4 old 


Using this procedure, the program will be computing vO) in the format given 
by v(x) « 

If the sampling factors are not negligible at each stage and that sampling has 
been performed using without replacement p.p.s. schemes at each stage, the 
variance expression at each stage must take into account joint selection prob- 
abilities. SUPER CARP and MINI CARP do not compute joint selection probabi1- 
ities. For the case where two units per stratum have been selected without 
replacement and unequal probability, the variance of the estimator for total 
can be obtained using formula (2.2.1) with a correction factor for each stra- 
tum which includes the joint probabilities of selection. This correction fac- 


tor is given by 


153 - 1 T 
fy = sash Sa n1 h2 9 n=, Tho | L 
h12 
where Tao is the joint probability of selection for the selected units 1 and 
2 n> 2 and that the joint probabilities of selection are not available, 


an approximation to the without replacement variance has been given by Gray 
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(1975). Gray shows that the variances of an unequal without replacement sam- 
ple may be partitioned into a "with replacement" variance component times a 
finite population correction factor which depends on the joint probabilities. 
This correction factor has been found to be roughly equal to one minus the 
inverse of the sampling fraction for populations which have more than 15 ele- 
ments within each stage. Using Gray's approximation, variances for multistage 


unequal without replacement schemes can be computed. 


If domain estimation is required for some of the variables, a new variable 
dv hi (k) is defined for all elements in the population, where 
‘s 
“hi, (k) if the hi.-th element belongs 
to the domain d (say Dg) 


Yn. = | 
d hi, (k) 0 otherwise 


An alternative way of defining dq’ hi (k) 1s 
4S bf 


Yre == Nae 3 Ts where 
d hi Ck) d hie hi, (k) 


1 if the hi -th element belongs to D 

S d 

fue , 
‘s O otherwise 

5 “a 
Note that if Y and v(¥) are unbiased for Y and v(Y) respectively, then the 
corresponding domain estimators qu and vC WY) are unbiased for q and 
ener The standard formulae for Y and v(Y) can now be applied to the "syn- 


thetic" variables |‘. - Stratum totals can be computed individually by 


treating the strata as classification variables. 
(ii) Ratio Estimator 


wee VR. , deineth lysi 
The vector N. Gon aoe hi (p) Xhi(p) } is used in the analysis 


and the estimatéd ratios”’are: 


R(t) = ve hae Poa ih Dae ae ae pee 
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where Y(t) and Xt) are of the form given in the previous section. The esti- 
“A “A “A 
mated covariance matrix for R = {R(1), R(2), 2.2.5 Rip) | is as given igs 


previous section with 
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d X > DW iY 
hi (t) ¢e) at i hi. 
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The ratio estimator can be used for computing the mean for each variable of 


interest by setting all X-variables to 1. Domain means can be computed by 


using yy 3 (t) in the place of “hi (t) Gd ne in the place of va (ie Li: 

subpopulation proportions of Y for a domain Dy are required, the numefator of 

the ratio is the sum of weighted keg (t) and the denominator is the sum of 
~s 


weighted Mae (t)° The estimated ratio for two variables defined over a domain 
Dd 


with the strata serving as the classification variables. 


may similarly be obtained. Stratum proportions and ratios may be computed 


(iii) Regression Estimation 


Some considerable attention has been paid recently to regression concepts in 
survey sampling. There are several explanations for this. First, there is an 
increased emphasis on analytic surveys, with partly unresolved questions of 
proper weighting of observations. Secondly, modeling in general, especially 
in the regression context, has attracted widespread interest, as well as crit- 
icism, as a tool in making survey estimates. SUPER CARP properly weights the 
observations and computes the variances of the estimated regression coeffi- 


cients using a method given by Fuller (1975). 


The regression coefficients estimated from a stratified cluster sample are 


given by 
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The variance estimation procedure is based on an asymptotic Taylor expansion 
of the sample regression coefficient vector. This method has several advan- 
tages over the Balanced Repeated Replication and Jack-Knife Replication 
methods. Firstly, it is relatively easy to program, and it can be adopted to 
multistage sample designs. Secondly, no restrictions are placed on the sample 
design (two replicates per stratum, for instance) and the assumptions used 
require some well-behaved moments in the population of interest. Thirdly, it 


requires the least number of computations. 


Data is quite frequently measured with error. Theory for regression models 
which takes measurement error into account has been given by Fuller (1980a), 
Fuller (1980b) and Fuller and Hidiroglou (1978). SUPER CARP also has the 
flexibility to compute tests of hypothesis for any subsets of the regression 


parameters. 


(iv) Contingency Tables 


SUPER CARP and MINI CARP perform the goodness-of-fit test and the independence 
test for data resulting from complex surveys. These two tests take the stra- 
tification and the clustering of the design into account. As pointed out by 
Rao and Scott (1981), pratictioners using traditional Pearson chi-quare sta- 
tistics for those two tests, given that there may be serious design effects 


can be seriously misled. 
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For the goodness-of-fit test, SUPER CARP and MINI CARP use the modified Wald 
Statistic given by 


re = [Cke1)aq7! (d-k+2) (i-p,)! qe (pape) 


Po 
where 


p = (P,, Po» aieinis ape is the vector of estimated proportions given in 
in the stratum and cluster configurations, 


= the covariance matrix of p given the stratum and cluster configura- 
Cron, 


Ry = (Poy? Pope coe paton,)| is the vector of hypothesized proportions, 
v 


k = number of categories considered, 


le 
d- y (n -1), 
h=1 R 


L is the number of strata in the sample and nj is the number of clusters in 
the i-th stratum. The covariance matrix V, is computed using the methods given 
for ratio estimation. In large samples, F is approximately distributed as a 


central F with k-1 and d-k+2 degrees of freedom when the null hypothesis is true. 


For the test of independence, Fuller (SUPER CARP Pp. 65-69) has developed a 
test which takes the design into account. Given that the contingency table 
which splits the population according to two criteria is made up of R rows and 


C columns, the null hypothesis to be tested is H): Pass =P pe ee or 


oF =U Pij - where Pij = ij-th cell proportion in the population, 


pee 2. p.. and eat =i a 
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Given that cs 4 is defined as pi. p.. and that the corresponding sample 


Se PS 


estimators are D. = Rg, Rj ,» estimates for (Rhy » Pyor coer Py Mes can be 


obtained by roreiee Pj i (i= 414. 2,-.065 Rtg) = ly 2. C1 Onn eee 
dimensional row vectors whose elements are one for the j-th entry correspond 
ing to Pili and zero ae ak The regression is of a generalized least- 
Squares nature because the Pijli do not have the same error structure. An 
estimator for the covariance matrix, of the regi eh incorporating the sample 
design, is obtained using the ratio estimator formulae. The test statistic 


for ile is then based on the residual sums of squares for this regression. 


3. INPUT 


In a typical survey situation, the data associated with a given selected 
unit is characterized by stratum, first stage, second stage up to s-th stage 
identification and a sampling weight. The data must be ordered hierarchically 
with respect to this identification in order to produce estimates of variance 


which reflect the stratified and clustered of the data. 


SUPER CARP and MINI CARP are run using command language specified in 
numeric codes in fixed card positions. For both programs, there are six man- 
datory control cards to be input at all times. A number of optional control 
cards may also be input if more information is required by options specified 
in the mandatory cards. The mandatory cards are the parameter card, the vari- 
able name card, the format card, the screening card, the analysis card and the 
variable identification card. The parameter card provides overall preliminary 
information to the program such as, problem identification, number of observa- 
tions to be read in, input service identification (tape, disk or cards), data 
identification structure, data output and stratum collapsing controls. The 
format card specifies the input format for the data as well at its identifica- 
tion and the associated weight. the variable name card assigns chosen names 
to input data fields in the order that they are read in. The screening card 
specifies tolerance limits for given variables provided that screening is 
required. The analysis card specifies the type of analysis to be performed 
(see table 1). Finally, the variable identification card identifies the vari- 


ables to be used in the chosen analyses. The optional cards include such 


cards as the sampling rate card (sampling rates by stratum can be read in), 
the errors-in-the variable cards for supplying the program with covariance 
matrices for variables measured with error, the hypothesis testing card for 


specifying coefficients in a regression analysis to be tested equal to zero. 


4. COMPUTATIONS 


4.1 For Means and Corrected Sums of Squares and Cross-Products 


The means, corrected sums of Squares and cross-products are statistics 
routinely computed in a survey package. The choice of algorithms for comput-— 
ing these statistics should take into consideration precision, speed and 
storage requirements. Beaton, Rubin and Barone (1976) have noted that a 
"concern about highly accurate computation methods must be tempered with a 
concern for whether the data are accurate enough to make the results meaning- 
ful". Different variations of one-pass and two-pass algorithms have been 
studied by Ling (1974). Ling's conclusion is that there is no universally 
best algorithm. The best algorithm for a given data set depends on the 
numbers in that data set. One of his recommendations is to use double preci- 
sion arithmetic to be beyond the accuracy attainable in Single-precision 
arithmetic. One-pass recursive algorithms should be chosen over the usual 
one-pass 'desk-machine' method because they have a higher tendency to produce 
less computational errors. This is especially the case for subroutines pro- 
grammed in single precision. In SUPER CARP and MINI CARP one-pass recursive 


algorithms programmed in double precision have been chosen. 
4.2 Inversion of Matrices 


Matrix inversion is required for regression and contingency table analysis. 
the choice for inversion algorithms is quite important in packages. This has 
been reported by Longley's (1967) paper in which he examined the accuracy of 
some inversion algorithms and found serious computational inaccuracies. He 
Teported that the most accurate results were obtained by using the 
orthonormalization procedure. Kopitze, Boardman and Graybill (1975) recommend 
the use of the Cholesky decomposition as an inversting algorithm. They point 


out that as compared to the Gaussian elimination schemes, it does not require 


=> et = 


pivoting to stabilize symmetric positive definite matrices. This means less 
time for inverting. The Cholesky decomposition does not need much core stor- 
age and is easier to program than the Gaussian elimination scheme. One of its 
other advantages, as Wilkinson's (1965) analysis shows, is that it is quite 
accurrate. Another of its advantages is that it can be used to find eigen- 
values for systems of equations of the form Ax = XB x where A is a positive 
matrix and B is a positive semi-definite matrix. Computations of eigenvalues 
are required in SUPER CARP for some of the errors-in-the variables regression 
analyses. It is for this reason and the precision considerations that the 


Cholesky decomposition has been adopted for inversion purposes in SUPER CARP. 


4.3 Stratum Collapsing 


If a sampled population is highly heterogeneous and several criteria are 
available for stratification, it is quite possible that some strata my contain 
only one cluster. For such strata, it is not possible to estimate the vari- 
ability. In such cases, the user may request that the one cluster strata be 
collapsed with neighbouring strata. If such a request if not made, SUPER CARP 
or MINI CARP exclude with one cluster from variance computations but include 
them for estimation purposes. The program lists those strata with only one 
unit. This information may lead the user to collapse those strata in a sub- 
sequent pass. If collapsing is to be done, the strata which are to be 
collapsed should be similar to neighbouring strata. A suggested method for 
collapsing which is easily amenable to programming is as follows. If a stra- 
tum is encountered that contains only one cluster, that stratum is combined 
with the following stratum in the file sequence. If the last stratum contains 
only one element, the last stratum is combined with the next to last stratum. 
A stratum with a sampling rate of one is not collapsed because such a stratum 
makes no contribution to the between primary component of the sampling vari- 
ance. Strata with a sampling rate of one should never appear after a stratum 
with only one cluster. One way to ensure this condition is to place all 
observations with a sampling rate of one at the beginning of the file 
sequence. If two strata are collapsed, the resulting sampling rate for the 
new stratum is computed as a function of the old sampling rates and and number 


of elements in the previous strata. 
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4.4 Clusters of Size One 


If clusters of size one within a stratum at the first stage, collapsing of 
adjoining strata ensures that variance estimates will be computed. For a 
multi-stage design, some of the stages may contain Single element clusters. 
For those clusters, no within cluster variation can be computed. There are 
several ways for handling this situation. One is to assume a zero-variance 
contribution from those Single-element clusters. Another is to collapse them 
with neighbouring clusters. An alternative is to assume that they contribute 
a variance equal to the overall within variation of the clusters for which the 
within variation can be computed. The variance contribution for those stages 
where some of the clusters are of size one would incorporate this approxima- 


teaOM\s 
>- SOME DESIRABLE FEATURES OF A VARIANCE ESTIMATION PROGRAM 


Francis, Heiberger and Velleman (1975) listed criteria useful in evaluating 
programs in general. In this section, some of the desirable features of a 
computer program for estimating variance from complex surveys will be listed. 
These include user's documentation, input controls, printed output and statis- 
tical effectiveness. These desirable features will be related to those pro- 
vided by SUPER CARP and MINI CARP. 


User's documentation should consist of a manual which basically tells the user 
how to use the program. SUPER CARP and MINI CARP both have manuals which 
explain to the user how to use them. These manuals are structured as 
follows. They contain an introduction which summarizes the various available 
statistical options. Data input and command statements used to specify proce- 
dures, variables and options are explained and examples are provided to illus- 
trate their use. Since data input and command statements are to be entered in 
a specific sequence, a flow diagram is provided. The program procedures are 
described in terms of the formulea used, the numberical techniques employed 


and some references to the literature. 


== or 


As stated earlier, the command language used for SUPER CARP and MINI CARP is 
in the form of code number or alphanumeric codes in fixed card columns. As 
pointed out by Francis, Heiberger and Velleman (1975), the most computation- 
ally efficient command languages employ code number in fixed card columns. 
The disadvantage of this method is that users may make excessive references to 
the manual to identify the commands. Procedures and options could have been 
specified with the addition of a control statement translator which the addi- 
tion of a control statement translator which would have allowed English like 
commands. The advantage of this input method is that it is relatively easy to 
learn. The disadvantage is that the time and effort required for programming 


this translator can be prohibitive. 


The printed output in SUPER CARP and MINI CARP identifies the statistical 
procedure used and labels the variables used in the analysis. Part of the 
output refers to the program's version number, name and date it was last up- 
dated. This identification can be used to trace and fix bugs in the stated 
program version. Some informative diagnostic messages are also printed out. 
These include messages referring to input controls such as attempting to read 
in more variables than the program has been dimensioned to handle, trying to 
read too many cluster, improper input format. If some strata contain one 
cluster, the program will print out list of such strata. If the user requests 


collapsing of single cluster strata, the resulting strata will be printed out. 


SUPER CARP and MINI CARP are written in FORTRAN and in double precision. They 
can be run on installations that have a FORTRAN compiler with minor modi- 
fications to the job control language. They can both be extented to accommo- 
date new statistical procedures. These can be placed in the program in the 
form of new subroutines which can be connected to existing software in the 


program. 
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TABLEAU 9b. Répartition en pourcentage des ménages selon la taille du ménage, le genre de réponse, le type de réaion et le nombre de mois 
d'inelusion dans l'échantillon, moyenne annuelle de 1980, Canada. 


UAR UNAR UNAR Total 
urbaines rurales 


Nombre Taille du Ménages Ménages non Ménages Ménages non Ménages Ménages non Ménages Ménages non 
de mois mnénage répondants répondants répondants répondants répondants répondants répondants répondants 
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3 17.4 We 7 (ow to 18.4 16.4 11.5 
4 18.7 Vo? 194 Doe) 21.3 16.3 9.1 
5+ 12.8 5.4 14.7 8.2 20.8 2.2 GreZ 
Total 100.0 100.0 100.9 100.0 100.0 100.0 100.0 
3 1 21.1 44.2 18.5 Bre es 26.0 41.5 
Z 30.1 DEYCE SORT 34.5 28.2 Byles) B30) 
3 WH AS 8.7 Nod BreZ 18.4 15.2 9.6 
4 18.5 Blo) 19.3 8.2 21.4 N62 9.4 
5+ 12.8 5.4 14.4 a0 20.6 ftivent 6.5 
Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0 
4 1 21.4 40.4 18.7 Say) 11.7 Zora 37.6 
2 30.2 31.0 30.1 33.1 28.1 34.3 31.7 
3 17.4 11.8 ou VZiee 18.4 14.3 ASS: 
4 1823 10.5 19.2 iloal 21.4 15.9 11.0 
5+ WaT 6.3 14.4 8.0 20.5 10.4 7.1 
Total 100.0 100.0 100.0 100.0 100.0 100 .0 100.0 
5 1 21.6 5D 62 18 .8 SYA lew 24.7 36.5 
2 30.2 32.6 30.2 34.7 28.2 33.8 32.6 
3 17.4 1057 Titi: 13.8 18.3 {SiSe) 12.0 
4 18.2 Jo7 19.0 14.4 21.6 iM 10.9 
5+ 12.6 7.9 14.4 6.0 20.3 10.5 8.0 
Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0 
6 1 2a, 37.0 18.9 28.7 11.9 23.6 19.8 34.1 
2 30.2 31.4 30.1 40.0 28.2 33.6 29 .8 32.4 
3 17.5 11.7 17.6 17.1 18 .3 13.6 17.7 12.5 
4 18.2 11.7 19/52 12.9 Lie) 18.2 18.8 12.7 
5+ 12.4 8.2 14.3 183 20.1 11.0 13.9 8.5 
Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 


GOS 


Ménages non 
répondants 


bo: = 


TABLEAU 9a. Répartition en pourcentage des ménages selon la taille du ménage, le genre de réponse, le type de région et le nombre de mois 
d' inclusion dans 1'échantillon, moyenne annuelle de 1980, Canada. 
UAR UNAR UNAR Total 
urbaines rurales 
Nombre Taille du Ménages Ménages non Ménages Ménages non Ménages Ménages non Ménages 
de mois ménage répondants répondants répondants répondants répondants répondants répondants 
1 1 19 00 397 17.8 30.9 10.3 7751 _ 
Zz 29.4 Baie 29 .6 32.8 27.6 34.0 
3 18.3 W233 WES 14.7 137 16.1 
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TABLEAU 8a. Répartition en pourcentage des personnes en fonction de l'activité, selon le nombre de 
mois d'inclusion dans 1'échantillon, la taille du ménage et le genre de réponse, 
moyennes annuelles de 1980 et 1981, Canada 


Nombre de mois dans 1'échantillon 


1 2 5 4 5 6 
Taille du Genre de 
ménage réponse 0 C I 0 C I 0 C I 0 C I 0) C I 0 C I 
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TABLEAU 5. Répartitions en pourcentage des personnes par groupe d'age et genre de réponse, moyennes 
annuelles de 1980 et 1981, Canada 
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TABLEAU 3. Répartition en pourcentage des ménages répondants et non ré 


l'échantillon, moyennes annuelles de 1980, et 1981, Canada 


Taille du ménage 


pondants selon la taille du ménage et le nombre 


de mois d'inclusion dans 


Nombre 1980 1981 
de mois — 

1 2 3 4 S+ Total 1 Z. 5} 4 5+ 
Ménages répondants 
1 17.4 29.1 18.3 19.5 (lads 100.0 187.3 DSS Wied 12oS) 14.7 
Z 18.1 JS) 18 «2 S| 15.4 100.0 19.0 29 .8 i] Toads 572 14.3 
3 18.4 29.2 18.1 1951 13.2 100.0 az 29.7 17.8 19.0 14.3 
4 18.5 27). 18.2 19.1 Ded 100.0 19.5 29.7 17.8 18.9 14.2 
5 18.7 29.1 18.1 19.0 15.0 100.0 ISy/ WE) 9 17.7 18.9 14.0 
6 (shoe) UDP 18.1 1250) 14.9 100.0 19.8 29.8 17.7 18.8 13739 
Ménages non-répondants 
{ Graal 32.3 13.2 Ue? Uo 100.0 Silat 3239 12.8 10.3 6.3 
2 38.9 32.7 VW. De 6.7 100.0 40.5 32.7 lakes 9.1 6.2 
BS Se) 52.2 (ae) 11.8 6.6 100.0 41.5 33.0 9.6 9.4 6.5 
4 34.9 Bo 02 13.4 1 Nici Gre 9 100.0 37.6 Silay 12.5 11.0 7.1 
5 Bis Day) 13.8 12> 6.9 100.0 36.5 32.6 12.0 10.9 8.0 
6 Bal 33.0 14.9 12.4 asd) 100.0 34.1 32.4 U/58) 12.7 855 
Taux de non-réponse 
1 15.39 7.64 5.10 4.10 3.30 6.94 12.81 7.34 4.85 3.63 2.97 
2 7.90 4.28 2.59 1.97 eval 3.84 7.03 3.74 DiilD) 1.65 1.51 
3 6.56 3.81 2.61 2.17 1.54 3.47 6.19 3.28 1.62 1.49 V37 
4 6.352 3.92 2.56 2015 1.61 3.45 532 3.02 2.01 1.67 1.44 
5 DG, Dis, 2.63 2.28 1.61 3.42 4.50 Ded 2 1.70 1.45 1.43 
6 5.05 3.41 2.51 2.00 1.59 3.03 3.82 2.45 1.58 1.53 Teod 
Total 7.48 4.50 3.00 2.44 1.89 4.02 6.58 3.16 2.33 1.90 1.69 
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TABLEAU 2. Répartitions en pourcentage et taux de non-réponse des ménages répondants et non 
répondants selon la taille du ménage, moyennes annuelles de 1980 et 1981, Canada 
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ey ee eee i 8S Ee 
Total Répondants Non- Taux de Total Répondants Non- Taux de 
répondants non-réponse répondants non-réponse 
ee ee ee) SRE ee 2 Ee eee ee ee ee 
Taille du 
ménage 
1 19;- 0, AB 05 35.4 7.48 19.9 ee 38.1 6.58 
Z DD Di aig LP oie 32.8 4.50 29.8 29.7 SENS 3-16 
3 17.8 18.2 13.4 3.00 17.6 17.8 (ie. gS) 
4 183° = 19.1 11.4 2.44 18.8 174 10.4 1.90 
5+ (EE lear 7.0 1.89 14.0 14.2 6.9 1.69 
Total 100.0 100.0 100.0 4.02 100 .0 100.0 100.0 3.43 
Taille 
moyenne 


des ménages 2.91 UD) 226 2.86 2.88 La 
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Tableau 5 Degré de dépendance moyen de l'estimateur dépendant de 1'échantillon séparé 


Division de recensement 


201 


202 


203 


204 


205 


206 


207 


210 


211 


212 


21 


214 


215 


216 


Za 


218 


vis-&4-vis de la composante synthétique: divisions de recensement de la Nouvelle-Ecosse 


Dépendance (1-6) vis-a-vis 


de la composante synthétique Proportion de la population de la division de recensement 
Kgp=0.5 Kop=1.-0 Strates partielles Strates complétes 

-04 Sue 1.00 - 

015 -20 -70 - 30 

-01 212 1.00 - 

eZ 22 1.00 - 

-04 14 1.00 - ' 

% 

03 -08 SoM 63 ; 

°05 07 -26 74 

-06 -09 235 -65 

-04 -05 Pale: -87 

-06 -10 292 -48 

-03 -16 1.00 - 

-04 -16 1.00 ~ 

11 21 1.00 ~ 

-05 215 1.00 - 

-01 -01 03 397 

-18 -28 1.00 - 


Tableau 4 CEF pour lesquelles le biais de l'estimateur synthétique séparé 
pour les chémeurs dépasse 10 % 


Catégorie (i) Catégorie (ii) Catégorie (iii) 
Biais rel. % Biais rel. % Dlaiserel ems 
Données’ Données Données Données Données Données 
CEF a jour pas a jour (EF a jour pas @ jour CEF a jour pas a jour 
102 1225 -3.90 414 (hee Wi 17.78 301 (AEE 2.209 
104 -12.52 6.23 426 -7.38 -11.78 304 8-11.29 -17.57 
411 -13.10 -0.37 450 1393 -11.35 412 -10.07 -15.80 
436 10.48 -0.48 455 2.67 15.46 438 W245 14.22 
474 18.35 -6.04 460 -7.74 20.80 501 10.52 16.83 & 
806 joa ley -0.20 504 Tog 41.05 O72 15.50 10.59 | 
527 vue) 11.76 818 10.83 39 41 
605 3.63 14.29 
701 4.90 17.74 
804 0.78 15.85 
813 -0.48 -15.46 
Catégorie (i): le biais dépasse 10 % seulement lorsque l'information auxiliaire est 2% jour. 


(ii): le biais dépasse 10 % seulement lorsque l'information auxiliaire n'est pas 4 jour. 


(iii): le biais dépasse 10 % dans les deux cas. 


Tableau 3 Biais absolu relatif moyen (%) des estimateurs synthétiques pour les CEF (l'information 
auxiliaire comprenant des données pas a jour sur la population des 15 ans et plus) 


Caractéristique Personnes occupées _Chémeurs 

Niveau de 

construction Séparé Combiné Séparé Combiné 
Province 
Terre-Neuve 1.20 Zoe Toi De Wh 
J le-du-Prince-Edouard O54 DOL Ber 2.80 
Nouvel le-Ecosse 0.87 1.64 (ewes: 1ele\7 
Nouveau-Brunswick Deo 2.54 6.59 7.04 
Québec ZD5 3.87 3.87 Ail 
Ontario Ze ld, ep ov a 3.01 4.25 
Manitoba 1.42 2.28 2.22 3.44 
Saskatchewan 2.41 2.65 4.35 4.85 
Alberta 5-24 7.08 Dsilz 10233 


Colombie-Britannique Tee Bar| Lene 4.20 


Tableau 2 Efficacité des estimateurs pour les petites régions par rapport a l'estimateur direct - 
divisions de recensement de la Nouvelle-fcosse (données auxiliaires pas a jour) 


Caractéristique 


Personnes 
occupées 


Chémeurs 


Variable 
auxiliaire 


Population 


Population 
par &age/sexe 


Population 


Population 
par age/sexe 


Niveau de 
construction 


combiné 


combiné 


" 


Pour domaines 


stratifiés 
a posteriori 


347 


3.60 


Synthétique 


4.73 


4.73 


Ze 


Zeki 


Es 


STIMATEUR 


Dépendant de 1'échantillon 


(Ke=10) 


1.68 


1.69 


Tableau 1 Efficacité des estimateurs pour les petites régions par rapport a l'estimateur direct - 
divisions de recensement de la Nouvelle-Ecosse (données auxiliaires a jour) 


ESTIMATEUR 
Pour domaines 
Variable Niveau de stratifiés Composite Dépendant de 1'échantillon 
Caractéristique auxiliaire construction a posteriori Synthétique ( & ¢0.223) Kg=0.5 Ko=1.0 
Personnes Logement combiné 4.58 10.17 10692 es WT | TO saz 
occupées 
Population i 4.92 10.75 10.58 10.50 Wer 
Population 5.08 10.83 Taye ical ht Vay) 
par Aage/sexe Ms 
Logement séparé Bel 2 10.50 - eG 10.50 
Population a Peers) 10% 92 - 10.58 11.42 
Population 2.83 11.00 - 11.00 (earls 
par age/sexe iW 
Chémeurs Logement combiné 1.33 1.76 Hol! 1.40 keDD 
Population 1.36 1.70 Vet> 1.43 1.58 
Population 156 1700) Ve d2 1.43 1.58 
par age/sexe W 
Logement séparé 1.30 1.69 - 1.48 hoo 
Population iN eee 1.69 - (econ YS 
Population sos 1.69 ~ ‘eso | 1.61 


par &age/sexe i 
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SURVEY METHODOLOGY 1983, Vol. 9 No. 1 


THE REDESIGN OF A SURVEY TO MEASURE COMMODITY ORIGIN AND DESTINATION 
MOVEMENTS BY THE FOR-HIRE TRUCKING INDUSTRY IN CANADA! 


Robert Lussier and Steven Mozes 


This paper firstly provides an overview of the For-hire Trucking 
Survey background and of the steps that were involved in the revi- 
sion that led to its re-design. It secondly describes the general 
direction of the methodology of the re-designed survey which is 
being implemented for reference year 1981. 


1. INTRODUCTION 


The For-hire Trucking Survey was initiated by Statistics Canada in 1971 to 
measure commodity origin and dest ination movements by the for-hire trucking 
industry in Canada. For the purpose of the survey, this industry was defined 
as the sum of trucking establishments engaged in transportation of freight for 
compensation. The survey was a probability sample Survey of shipments re- 
corded on the shipping documents retained by Canada's for-hire trucking 
firms. Since 1971, the demand for more reliable and more detailed information 
has been increasing steadily. This increased demand can be attributed to a 
number of factors such as the dramatic growth of trucking activity since the 
early fifties; the increased sophistication of users of transportation 
statistics; the growing interest in the subject of economic regulation versus 
deregulation, and finally the increased market share of trucking, at the 
expense of other modes of transport, within the overall freight transportation 
market. In the late 1970's, Statistics Canada in cooperation with major users 


embarked on a complete revision of the survey. 


—— 


! This is a revised version of an invited paper presented at the Joint 
Statistical Meetings of the American Statistical Association, the Biometric 
Society, ENAR and WNAR, and the Institute of Mathematical Statistics, 
Cincinnati, August 16-19, 1982. 


Robert Lussier, Business Survey Methods Division and Steven Mozes, 
Transportation and Communications Division, both of Statistics Canada. 
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It is the intention of this paper to serve two purposes. Firstly, it provides 
an overview of the background of the survey and a description of the steps in 
the revision process, and secondly, it describes the methodology of the rede- 
signed survey which is being implemented for the reference year 1981. It 
should be noted that the details of the methodology of some phases have not 
yet been finalized; this paper will, however, include descriptions of the 


general direction of the incomplete phases. 


2. BACKGROUND 
2.1 Brief overview of the Canadian for-hire trucking industry 


The Canadian for-hire trucking industry is characterized by a very large num- 
ber of small operators and by a high degree of heterogeneity as manifested by 


the variety of commodities carried, size of operators, and area of operations. 


Small carriers defined as carriers earning less than $100,000 represent 88% of 
the industry, measured in terms of numbers; however, they represent only 20% 
of the industry when measured in terms of operating revenues. The existence 
of this large number of operators, their volatility and their relative insi- 
gnificance in terms of revenues lead to the decision to exclude them from the 


survey population. 


Trucking firms are involved in the transportation of widely differing commodi- 
ties, requiring different kind of equipment and operating practices. The 
various carrier types (e.g. general freight, bulk petroleum, household goods 
movers, etc.) differ from each other not only in terms of the commodities they 


carry but also in terms of shipment size. 


Heterogeneity can also be illustrated by describing the area of operation. 
Some trucking firms operate locally only, others intraprovincially, and some 


of the larger carriers, in each province as well as internationally. 
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The combination of these factors have implications on the survey design espe- 


cially on stratification. 

2.2 History of Canadian truck origin and destination surveys 
$e ne cestination surveys © 

(a) The Motor Transport Traffic Survey (1957-1963) 
ee a UEV EO 1702p 


The first attempt to measure truck traffic in Canada was made in 1957 with the 
introduction of the Motor Transport Traffic Survey (MITTS). This Survey was a 
sample survey of motor vehicles engaged in freight transportation. The survey 
frame was a list of registered motor vehicles. It originated from the motor 
vehicle registration files maintained by the provincial or territorial govern- 
ments. This frame was stratified by type of operation and gross vehicle 


weight. 


The sample size was approximately 10% of all registered vehicles. The sample 
was selected in four quarterly segments with approximately one fourth of the 
total sample seleted each quarter. Each quarterly sample was spread over 
three survey weeks with one third of the sample being used for a seven day 


period per month. 


As the survey was conducted on a vehicle basis, no information was requested 
regarding the detailed origin and destination of the commodities Catbied,...aLt 
was a truck origin and destination Survey; commodities had secondary importan- 
ce. Data relating to the vehicle such as the description of the vehicle, 
miles travelled, fuel consumed and the operating cost associated with the 


vehicle were also collected. 


The survey was in operation from 1957 to 1963 inclusive. It was discontinued 
in 1964 because of changes in vehicle licensing systems, structural changes in 
the industry, and most importantly because of a very se-rious deterioration in 


response rates. 


(b) The For-hire Trucking Survey (1969-1979) 


Initial work on a survey to measure domestic intercity origin and destination 
traffic movements of goods by the total Canadian for-hire trucking industry 
began in 1969. At that time, a study of various methods of collecting 
commodity origin and destination statistics was carried out. The study 
results showed that a sample survey of the carriers’ administrative records, 
namely their shipping documents, was a viable approach to the collection of 


the required data. 


In 1970, a pilot survey was conducted to assess the effectiveness of the sur- 
vey approach. The pilot survey involved the examination of the shipping docu- 


ments of 187 for-hire trucking firms throughout the country. 


The favourable response to the pilot survey and the availability of origin, 
destination, commodity, weight and revenue information on the shipping docu- 


ments indicated that the survey approach was feasible. 


For-hire trucking surveys were therefore conducted for reference years 1970 
and 1971 with the above-mentioned objective. For reference year 1972, the ob- 
jective was modified to restrict the survey to Canadian domiciled for-hire 
carriers earning $100,000 or more annually from inter-city trucking. For 
reference year 1973, an updated and better-defined frame of regulated motor 
carriers was used and a more effective sampling procedure was developed. 
Since reference year 1973, the survey has been conducted and the results 
published on an annual basis by the Transportation and Communications Division 


of Statistics Canada [1] [5]. 


3. REVISION OF THE FOR-HIRE TRUCKING SURVEY 


The revision process consisted of two main phases. Firstly, a critical re- 
view of the existing survey was initiated. Secondly, on the basis of the re- 


commendations made during the review process, a complete survey redesign was 


undertaken. 

3.1 Survey review 

(a) Reasons for the review 
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In early 1978, Statistics Canada initiated a review of the For-hire Trucking 
Survey for the following reasons. First, it has been the policy of the 
Transportation and Communications Division of Statistics Canada to conduct a 
periodic review of each of the ongoing surveys. The For-hire Trucking 
Survey had not been reviewed since 1973. Secondly, the current and anticipa- 
ted future needs of users for increased details for commodity origin and 
destination statistics could not be satisfied within the constraints of the 
Survey. Thirdly, experience gained during the undertaking of the For-hire 
Trucking Surveys and other related surveys provided additional information 
upon which the frame, the stratification variables and the imputation techni- 
ques could be improved. Fourthly, some developments in the trucking industry, 
such as the availability of origin and destination information in machine 
readable format lead to the belief that computer tapes could be utilized to 


increase the data base and reduce reporting burden at the same time. 


In addition, increased sophistication of users required the development of im- 
proved data dissemination procedures while changes in data processing techno- 
logy had made the present data processing system not only obsolete from a 


technical point of view but cost inefficient as well. 
(b) Phases of the review 
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The survey review was originally organized into two parts, namely Phase I and 


Phase II. 


The objective of Phase I was to outline recommendations which concentrated on 
improvements to the survey within its existing framework using only limited 


additional resources. The recommendations had to focus and indeed did focus 


on a redefinition of the survey population, improvements in stratification va- 
riables and an increase in the sample size of shipping documents. The recom- 


mendations were presented in a report alle 


The objectives of Phase II were to assess the Phase I recommendations from a 
user point of view, to present various cost and implementation alternatives 
for the accepted recommendations and to complete further survey analyses. 
Phase II reformulated some of the Phase I recommendations and added additional 
recommendations which aimed at a smaller population of firms better stratified 
into more homogeneous groups. In addition to recommending the implemention of 
these recommendations, four alternatives for increasing the sample size were 
considered namely, the status quo; an increase of 50% to the sample of ship- 
ping documents; an increase of 100% to the sample of shipping documents; and 
finally an increase of 25% to the sample of shipping documents together with 
the processing of available carrier data tapes for presumably 40 or so firms. 
Based on an assessment of the advantages and costs of each of these alternati- 
ves, the latter one was approved in principle because it offered the potential 
for substantial sample size increases with a minimum of cost and data collec- 
tion burden. The recommendations and the supporting details were tabled in a 


report [3]. 


A preliminary assessment of the impact of the recommendations revealed that 
further work was necessary especially to determine the full costs for the use 
of carrier waybill tapes. Therefore a Phase [II was added tothe survey review 
process. In general, its terms of reference were to conduct the investiga- 
tions required to formulate and recommend general specifications for a revised 
survey. The investigations had to follow the recommendations of the Phase II 


review. 


In June 1980, Phase III proposed that the survey be redesigned to accept four 
types of input namely, tapes from selected respondents; transcriptions from 
samples of shipping documents drawn from each Document Storage Location Point 
(DiSclsP.)° havingoversSimilLLion interes ty domestic revenue annually; trans- 


criptions from samples of shipping documents drawn from a sample of D.S.L.P.'s 


having between $350,000 and $1.5 million intercity domestic revenue annually; 
and macro information from D.S.L.P.'s with annual intercity domestic revenue 
between $100,000 and $350,000. The decision to collect macro information from 
the smaller carriers was based on that fact that these firms do not keep the 


documentation needed for sampling purposes. 
3.2 Survey redesign 
(a) Objective of the redesign 
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After the completion of Phase III of the Survey review process, it was decided 
to carry out a complete redesign of all aspects of the Survey. The objective 
of the redesign was to provide more reliable and more detailed commodity 
origin and destination statistics relating to the Canadian for-hire trucking 
industry. It was expected that both the Teliability and the amount of 
regional and commodity detail available could be increased when compared with 


the "old survey". 
(b) Constraints on the redesign 
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The main constraints imposed on the redesign were: that the survey population 
exclude some types of for-hire trucking firms namely, own account household 
goods carriers and oil field Carriers; that the stratification be improved to 
be more in line with the economic structure of the industry; that three types 
of input be accepted, namely, tapes from selected respondents, transcriptions 
from samples of shipping documents drawn from D.S.L.P.'s of a sample of firms 
having more than $350,000 intercity domestic revenue annually = and 
macro-information from a sample of firms having annual intercity domestic 
revenue between $100,000 and $350,000; and finally that the redesigned survey 
be implemented for the reference year 1981, data collection starting in the 


spring of 1982. 
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4. POPULATION AND FRAME 


The population of the survey covers all shipments made during the reference 
year by those trucking firms which are defined as in scope for the survey. A 
shipment is defined as a quantity of merchandise transported by the carrier 
from one person or organization to another person or organization. The 
in-scope firms include those which earn more than $100,000 annually from 
intercity freight transportation, whose main activity is trucking and who are 
Canadian domiciled. Excluded from this population are the shipments of 
certain types of specialized carriers such as the oilfield carriers and own 


account household good movers. 


However, this ideal population is not accessible. As a substitute, firms are 
used as natural clusters of shipments for the first-stage sampling units of 


the design. 


For this reason, the frame consists of a list of all firms which have domestic 
intercity revenue over $100,000. Firms may further be segregated into 
D.S.L.P's. This is the case for those firms whose shipping documents are not 
stored at a central location. The frame is derived from an annual census sur- 
vey of for-hire trucking conducted by Statistics Canada, the Motor Carriers- 


Freight and Household Goods Movers Survey>. 


5. ULTIMATE SAMPLING UNIT 


The survey accepts three different inputs, namely tapes from selected car- 
riers, transcribed information from sampled shipping documents and finally ma- 


ero information from carriers earning between $100,000 and $350,000 annually. 
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The Motor Carriers-Freight and Household Goods Movers Survey of Statistics 
Canada is an annual census survey of trucking establishments. Its objecti- 
ve is to obtain establishment-oriented input-output data such as revenues, 
expenses, balance sheet information and equipment operated. 


The tapes contain information relating to individual shipments, the characte- 
Tistics of which are the same as those which are recorded on shipping docu- 
ments. Therefore, the ultimate sampling unit for those firms which either 
provide tapes or whose shipping documents are sampled is the shipment. For 
the carriers in the $100,000 to $350,000 range, macro information is obtained 
as these firms do not usually keep the necessary documentation relating to 


shipments. For these carriers, the ultimate sampling unit is the firm. 


6. INFORMATION COLLECTED 


The principal characteristics needed from each shipment sampled from carriers 
earning more than 350,000 intercity domestic revenue annually are the true 
origin and the final destination; the description of the commodity(ies) car- 
ried; the weight and the unit of weight; the transportation revenue earned and 
the interlined shipment information. Interlining occurs when a consignment is 
moved by a carrier to an intermediate point and then moved by another carrier 
to another point. The interlined shipment information is used to eliminate 


duplications. 


The secondary characteristics needed are the date of shipment; the quantity of 
commodity and the unit of measurement (e.g. 5 board feet, a 20) dallons eats 
sacks); some information regarding the shipment weight transcribed (e.g. 
minimum weight, convenient weight used for calculating revenue); the rate 
charged and the rate condition codes (e.g. a code indicating where rate is 
minimum, per 100 lb., per hour) and the revenue condition codes. (e.g. a code 
indicating where exact transportation revenue is not available, where the 


shipment is out-of-scope). 


The macro information collected from the smaller carriers describe the 
average or typical shipments in terms of originating province, destination 


province, commodity, average revenue, average weight and number of shipments. 
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7. ADMINISTRATIVE RESTRICTIONS 


The amount of resources available for data collection and processing and the 
goal to reduce the burden imposed on the respondent put a limit on the number 


of firms selected and on the number of shipments selected and transcribed. 
7.1 Maximum number of firms in the sample 


The population of the 1981 For-hire Trucking Survey" consists of 2,711 firms 
of which 1,288 earn more than $350,000 annually while 1,423 earn between 
$100,000 and $350,000 annually. 


As data collection is very expensive due to the very high cost associated with 
travelling to remote areas, efforts are being made to limit the number of 
D.S.L.P.'s selected in the sample from those carriers earning over $350,000 
annually. The limit is set at 875 D.S.L.P.'s per year, which has been the 


historical number during the last ten years of the old survey. 


7.2 Maximum total number of transcriptions 


The second administrative restriction relates to the total number of trans- 
criptions. The present budget allocation allows a maximum of 418,000 trans- 
criptions. This number may vary from year to year depending on negotiations 


between Statistics Canada and users who are also cofinancers of the survey. 
7.3 Maximum number of transcriptions per firm 


There is also an administrative restriction which relates to the maximum num—- 
ber of transcriptions per firm. There 1s an implicit limit imposed on the 
number of days the data collection team can spend at any particular location, 
so that the respondents are not burdened by the presence of the Statistics 


Canada regional operations personnel. 


* 1981 For-hire Trucking Survey means the survey conducted in 1982 for refe- 


rence year 1981. 
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8. STRATIFICATION AND SAMPLE ALLOCATION 


Using the results of the previous year's Motor Carriers-Freight and Household 
Goods Movers Survey, the firms are stratified according to their in-scope 
transportation revenue, type of operation and area of operation. These varia- 
bles were chosen because they characterize the heterogeneity of the industry. 
The in-scope transportation revenue indicates if the firm is a Class 1; a 
Class 2 or a Class 3 firm i.e. if the firm earned $2.7 million or more, bet- 
ween $350,000 and $2.7 million or between $100,000 and $349,999 dollars of re- 
venue respectively from Canadian intercity non-armoured and non-household 
goods freight transport. The type of operation characterizes the firms as 
specializing in general freight small shipments, general freight large ship- 
ments, automobiles, liquid petroleum, dump trucking, forest products, building 
materials, dry bulk and/or refrigerated liquids, heavy machinery, refrigerated 
solids, explosive and/or other dangerous goods, agricultural products, animals 
and van lines. The general freight small shipment carriers are general 
freight carriers for which the average revenue per shipment is less than 
$85.00; the general freight large shipment carriers are the rest of the gene- 
ral freight carriers. The area of operation indicates the specific Canadian 
province, Yukon or Northwest Territories, or that combination in which the 
which the firm operates. For example, an area of operation could be New 
Brunswick, meaning that the firm operates in New Brunswick only. Another 
example would be Atlantic which means that the firm operates in 2 or more of 
Newfoundland, Prince Edward Island, Nova Scotia and New Brunswick but nowhere 


else in Canada. There are 20 of these areas of operation. 
The dollar cut-offs used in the stratification by revenue and by type (i.e. 
$85, $350,000 and $2.7 million) are flexible and may vary in the years to come 


depending on the changes occurring in the population. 


The above stratification creates 840 strata of which 355 were non-empty in the 


1981 For-hire Trucking Survey. 


Once the frame is stratified, subject matter officers may identify take-all 


oy 


firms i.e. firms that they want to be included in the sample with probability 
one. Next, a methodologist determines the number of firms to be selected 
among the non take-all firms in the stratum. To do so, he goes through seve- 


ral steps from which the take-all firms are excluded. 


First, a computer program calculates the initial number of firms to be selec- 
ted in each stratum to meet a target coefficient of variation of the estimate 
of in-scope revenue in the stratum. This target coefficient of variation is 
the coefficient that one would like to obtain if the estimate were calculated 
using the reported total revenue from a sample of firms selected using simple 
random sampling from a population of firms for which the distribution of the 
in-scope revenue is the same as the distribution of the in-scope revenue of 
the previous year's Motor Carriers-Freight and Household Goods Movers Survey. 
The formula is: 
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where Nn » initial number of firms to be selected among the non take-all 
BA firms in stratum h; 


Ne : number of non take-all firms in stratum h; 
i : total in-scope revenue of the non take-all firms in stratum h; 
oh » variance of the in-scope revenue of the non take-all firms in 


stratum h3; and 


C.V., : target coefficient of variation in stratum h (the value used is 
the same for all strata of a given class but may vary from 


class to class). 


Secondly, the initial sample sizes are revised to ensure that a minimum number 


of firms is selected from each stratum i.e. 
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om, = min {max (m, 1) NI 
where om, f revised initial number of firms to be selected among the non 


take-all firms in stratum h; and 
m : minimum number of firms to be selected in stratum Dat possible’. 


Then the revised initial Sample sizes are summed over the strata to get a 


total revised initial sample size. 


Next, the sample sizes are again reviewed to ensure that the sample size ina 
given stratum is greater or equal to the sample size that one would have 
obtained if he had distributed the total revised initial sample size of a 
class across the strata of the class proportionally to the square root of the 


number of firms in each stratum i.e. 


where the summation is done over all strata of the same class than stratum h. 


Finally, the survey manager may subjectively adjust the sample sizes to 4m ° 


The above sample allocation method has been retained because it is an 
algorithm which has given satisfactory results during the testing phase as 
well as has made use of the only variable that was available for all firms 
namely the in-scope revenue of the firms. Nevertheless, it should be realised 
that the in-scope revenues of the firms are not collected directly in the 
For-hire Trucking Survey but revenues from a sample of shipments are collec- 
ted. Therefore the above method of sample allocation ignores completely the 


second stage of sampling. 


4 For reference year 1981, this minimum was set to 3 for all strata. 
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9. FIRST STAGE SAMPLE DESIGN 


The first stage consists of selecting in each stratum a number of firms cor- 
responding to the number of firms 4h? determined at the sample allocation 


stage. 


All firms earning $2.7 million or more of in-scope transportation revenue were 
made take-all i.e. were selected with probability one in the 1981 For-hire 
Trucking Survey. The reason for this approach is that these firms are known 
to be heterogeneous with respect to the principal statistics to be estimated 
and are known to be contributing a large proportion of the revenue figures to 


be estimated. 


The sample of firms is finally converted to a sample of D.S.L.P.'s by inclu- 


ding in the latter sample all D.S.L.P.'s of the selected firms. 


10. SECOND STAGE SAMPLE DESIGN 


The second stage of the sample design for D.S.L.P's of Class 1 and Class 2 
firms consists of selecting a systematic sample of shipments from the files of 
each selected D.S.L.P. This selection is done by Statistics Canada Regional 
Operations Division interviewers at the D.S.L.P. The sampling intervals used 
are different depending on the number of shipments carried by the firms. They 
are generally obtained from a table provided to the interviewers. This table 
gives various file size ranges with their corresponding sampling interval. 
However, the sampling intervals may be pre-determined for any given firm by 
Statistics Canada Head Office staff. This is especially the case of multi- 
D.S.L.P. firms because the interviewer at a given D.S.L.P. may not know how 
many shipments were carried by the firm as a whole. This is also the case for 
firms having special characteristics, such as firms carrying dangerous goods, 
and others for which the survey manager may want a larger data base. In sub- 
sequent years, this may also be the case for firms contributing to domains 


where the reliability of the estimates in the previous year was less or more 
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than what was desired. 


For D.S.L.P.'s of Class 3 firms, there is no second stage sampling design. 
Individual shipments are not selected from the FPles ~oP) theraDeSelePais) 


Instead, aggregated data are collected at the DiSsilusP.clevels) 


11. FIELD OPERATIONS 


The field operations are different for class 1 and class 2 firms than for 
class 3. firms. For class 1 and class 2 firms, the operations consist of 
selecting shipments from the files of their DS alePs ts Handbof transcribing the 
characteristics of the selected shipments on coding sheets. For class 3 
firms, they consist of obtaining aggregated data over the telephone about 


their trucking operations. 


This section discusses the activities that involve the Statistics Canada 
Regional Operations staff; namely the training of the Regional Operations pro- 
ject managers, the planning of the collection, the collection at the 
D.S.L.P.'s of class 1 and class 2 firms, the collection from the class 3 firms 


and finally the profiling of class 1 and 2 D.S.L.P.'s. 
11.1 Training of the regional operation project managers 
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Every year, the Statistics Canada regional operations project managers are 
trained on all aspects of the survey. The training session is four days long 
and is conducted during the month of March. It is broken down into two compo- 
nents: an in-class-training and an on-the-job training. The in-class training 
Consists of a series of talks and exercises given by the survey project mana- 
ger and the methodologist(s). The on-the-job training consists of having 
groups of three to four people visiting a D.S.L.P. and applying and discussing 


the knowledge acquired during the in-class training. 
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11.2 Planning of the collection 


Having been trained, the regional operations project managers recruit the in- 
terviewers and administer a thorough training program. Then the interviewers 
with the advice of their regional operations project manager schedule their 
work and plan their itineraries for their visits to D.S.L.P."s of class” 1and 
class 2 firms. The itineraries are drawn to avoid unnecessary travel and to 
achieve maximum productivity. The interviewers mail to the DIS. LSPs! offaetals 
introductory letters which provide a brief explanation of the survey. Subse - 
quently, the interviewers telephone D.S.L.P. officials for appointments. The 
collection of the data takes place between May and September for the survey 


covering the previous calendar year. 


11.3 Overall description of the collection in the D.S.L.P.'s of Class 1 and 
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At the time of the appointment, the interviewer conducts an interview with the 
D.S.L.P. officials. During the interview, he/she explains the survey, descri- 
bes the uses of the data, estimates the time required to do the work and asks 
information about the firm. This information concerns mainly revisions to the 
names and addresses, changes of ownership, type(s) of document and filing sys- 
tem used and aggregated data about the operations of the D.S.L.P. during the 


reference year. 


The most common types of shipping documents are the probills, bills of lading, 
load manifests, trip reports, and invoices. A firm may use any combination of 


these. 


The types of filing system include: in complete numeric sequence}; in broken 
numeric sequence; in chronological order; in alphabetical order (e.g. by cus- 
tomer name); by terminals; by commodity type or in no order at all. The docu- 
ments may even be cross-filed; for example, by serial number and by customer's 
name. Within a filing system, documents may be kept in a set of file drawers, 


in sets of binders or shannon files, on shelves, in drawers, or even in books. 


» fF = 


The aggregated data about the operations of the D.S.L.P. cover several varia- 
bles among which are the total transportation revenue earned; the total tonna- 
ge carried; the total number of shipments transported; the percentages of each 
of these three items represented by intercity shipments and the percentages 
represented by international shipments; the types of commodities carried and 


the percentage each type represents in the total transportation revenue. 


Often the interviewer has a choice of filing systems which provide informa- 
tion on the items needed in the Survey. The interviewer assesses the comple- 
teness of the various filing systems with regard to the information on the 
five principal characteristics and on the reference year, and then chooses the 
system having the smallest under-coverage. However, if two or more systems 
have the same under-coverage (if any), the interviewer selects the one that 
includes the smallest number of out-of-scope records or the one that allows 


out-of-scope records to be removed from the file or not to be counted. 


Next, the interviewer selects the sample of shipments as follows. Using the 
number of shipments reported by the official of the D.S.L.P., he/she gets from 
a table the corresponding sampling interval and random start. In some instan- 
ces, the interval and the random start may have already been pre-determined by 
Statistics Canada Head Office. Next, he/she adds the random start and/or the 
interval to the document numbers to get the selected shipments in numeric 
filing system. Otherwise, he/she has to count a number of documents equal to 


the random start and/or to the interval to get the selected shipments. 


Once a shipment is selected, the interviewer transcribes its characteristics. 
The transcribing operation is often difficult because it can be hard to under- 
stand the various documents and the coding used on some documents. This is 
especially true for the commodity names. The interviewer must avoid the use 
of brand names, proper names and names which have more than one meaning. The 
interviewer often has to interpret the information on the documents and to en- 
ter on the coding sheets the data in a format that would be accepted by the 


computer system. 
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11.4 Overall description of the collection in the Class 3 firms 


The interviewer mails an introductory letter two to three weeks prior to any 
attempt to contact the firm by phone. Subsequently, he/she contacts the offi- 
cial in the firm that is best suited to provide the required information. 
This may, however, take several phone calls. The interviewer then conducts an 


interview over the phone. 


During the interview, he/she will ask questions similar to the questions asked 
for class 1 and 2 D.S.L.P.'s. There is a major difference however; no ques- 
tions are asked about the types of documents utilized and the filing systems 
used by the firm. Once this first part of the interview is completed, the in- 
terviewer proceeds to have the respondent describe his types of shipments. 
For each type of shipment, the description is to be made in terms of province 
of origin; province of destination and name of commodity carried. Then the 
official is asked to report an estimate of the number of shipments, the avera- 


ge weight and the average revenue of each type of shipment. 


It is a general subject matter belief that the operations of any given class 3 
firms are fairly homogeneous. Therefore each has only a few types of ship- 
ments to report. The coverage obtained throught this approach is also belie- 
ved to be acceptable from a user point of view. No testing was done of this 


hypothesis. 
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It sometimes happens that a class 1 or 2 D.S.L.P. cannot provide any docu- 
ments, does not keep documents suitable for sampling or cannot provide a por- 
tion of the shipping documents and that this portion cannot be represented by 
the available documents. The latter may happen for example when the missing 
documents represent specific contracts that have been removed for audit purpo- 
ses. In these cases, the interviewer has to "profile" the missing documents. 
The profiling consists in having a D.S.L.P. official describe the types of 


shipments on the missing documents. The profile is similar to the description 


- 12 = 


of the types of shipments of the class 3 firms with the exception that the 
precise origin and the precise destination of the shipments (i.e. the village, 


town, city, etc.) is wanted in this case. 


The profiling activity can be long in some D.S.L.P.'s because their operations 
can be quite extensive. It requires good cooperation from the D.S.L.P.'s 


Dihleral . 


12. DATA PROCESSING 
12.1 Manual processing 


The completed documents are sent to Statistics Canada Head Office in Ottawa. 
Upon receipt, the documents are logged in and the identification numbers veri- 


fied. Two short tasks are also undertaken at this point. 


First, a brief scan is conducted to identify and code closings of DaS al. « Predisi, 
death of firms, out-of-scope firms and abortions. Out-of-scope firms are 
active firms for which the in-scope revenue is nil for the reference year. 
Abortions are D.S.L.P.'s for which no information was collected although it 
was known that the D.S.L.P. had in-scope revenue for the reference year. As 
examples, a firm found in the field to have earned its revenue 100% from local 
shipments would be an out-of-scope firm while a single D.S.L.P. firm that 


tefuses to cooperate or is on strike would be an abortion. 


Secondly, the profile data of the class 1 and 2 D.S.L.P.'s are examined to 
determine the number of shipments that should have been transcribed for each 
feported type of shipments if the documents had been available. These numbers 
are determined by performing calculations using the total number of shipments 
covered by the profile, the random start and the sampling interval that should 
have been used if the documents had been available. These numbers are then 
Coded so that the computer could generate the required number of transcription 


fecords for each type of shipments as if transcriptions were obtained. 
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12.2 Data capture 


The forms are next sent to data capture. The capture is done on a mini- 


computer which allows edits and other processing to be performed on-line. 


There are many edits performed on the mini. Some edits generate error messa- 
ges and require corrective action; others generate warning messages that re- 
quire verification of the entered data and corrective action only if necessary 
Some edits consider the validity of each response individually while others 
consider the relationships between valid characteristics of the same ship- 
ment. The operators of the mini-computer are expected to possess subject mat- 
ter knowledge to perform corrections on-line. Manual imputations are perfor- 
med when necessary because there is no automated imputation performed for 
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As part of the other processing, the weight is converted to metric and the 
rate to $/100 kilograms. Also, the origin and destination names (i.e. villa- 
ge, town, etc.) if present, are matched against a municipality library to 
obtain a Standard Geographical Classification (Sv.G.C.) code, av latitude and a 
longitude. Whenever there is a nonmatch, the operator is instructed to enter 
a synonym. Similarly, the commodity name is matched against a commodity li- 
brary to get a 3-digit Standard Commodity Classification (S.C.C.) code. When- 
ever there is a nonmatch, the operator uses a synonym or enters "unknown". 
There is therefore always an S.C.C. code for each shipment. Also, the mini 
generates the required number of transcription records for each type of ship- 
ments from the profile data of the class 1 and 2 D.S.L.P.'s. 


Finally, the data are unloaded from the minis and two data sets are created; a 
data set of shipments of class 1 and 2 firms and a data set of type of ship- 
data of class 3 firms. The principal difference between the two data sets are 
that the first one is at the shipment level while the second one is at an ag- 
gregated level. Note also that the first one has more variables (e.g. rate, 


place of origin rather than province of origin, etc.) than the second one. 
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12.3 Main frame edits and imputations 
a a PC ORL ONS: 


A road distance between the origin and the destination of each in-scope class 
1 or 2 shipment has to be obtained in order to be able to provide tonne- 
kilometres estimates for class 1 and 2 firms. Therefore, the Origin 9S<GToR (Ss 
destination S.G.C. pair of each in-scope class 1 or 2 record is matched 
against a distance library to get a road distance in kilometres between the 
two locations. Whenever there is a nonmatch, an aerial distance (X) is calcu- 
lated using the latitudes and longitutes of the origin and of the destination. 
using the latitudes and longitudes of the origin and of the destination. Then 
X is converted to a road distance Y using the simple linear regression model 
YPS) axa b 
where a and b vary according to 12 regions of origin and 12 regions of desti- 


nation. The road distance is assigned to the record. 


Missing data of partially transcribed shipments of class 1 and 2 firms are 
also imputed. The imputation technique used depends on the missing variable 
or the pair of missing variables. Major imputations are performed using fixed 
relationships between reported figures, unit weight conversion factors and 
pro-rate tables. An example of a fixed relationship between reported figures 
is 
weight = revenue x 100 
rate 

This relationship can be used to impute weight when revenue and rate are pre- 
sent or revenue when weight and rate are present. Unit weight conversion fac- 
tors are coefficients determined by unit type (e.g. case, bag, litre, etc.) by 
@.C.C. code. Knowing the unit and the S.C.C. code of the commodity, the pro- 
per conversion factor can be applied to the quantity of units to derive the 
weight. Finally, pro-rate tables show rates by commodity section, by distance 
block and by revenue or weight group. These tables are based on the previous 
years' data modified by incoming valid current-year data. The pro-rate tables 
are used to calculate the weight when the revenue is present or the revenue 


when the weight is present. 
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In cases where too many characteristics of a shipment have to be imputed, the 


shipment is flagged as not usable. 


Expansion edits are subsequently performed. For class 1 and 2 firms,; these 
edits consist of weighting up crudely the number of shipments transcribed, the 
transcribed revenue and the transcribed tonnage and comparing the results to 
the total number of shipments, revenue and tonnage provided during the inter- 
view by the D.S.L.P. official. Similar edits are performed for class 3 


firms. Discrepancies in both cases are followed up. 


13. ESTIMATION PROCEDURES 


For the estimation procedures, it was decided to consider the second stage 
systematic sampling in the class 1 and 2 firms as simple random sampling with- 
out replacement CSeRéS.WeGeR. This decision was made because first the 
documents were considered to be in random order and secondly the use of 


S.R.S.W.O.R. allows the computation of an estimate of the sampling variance. 


As the first step of the estimation procedures, weights are calculated. There 
are first stage and second stage weights for class 1 and 2 records but only 
first stage weights for class 3 records. In general, first stage weights cor- 
respond to the inverse of the probability of selecting of a D.S.L.P. in its 
stratum and second stage weights correspond to the inverse of the probability 
of selecting of a shipment in its D.S.L.P. supposing S.R.S.W.0.R. was used. 
First stage weights are adjusted by computer to reflect the contribution of 
abortions. No adjustments are made for the closing of D.S.L.P.'s, deaths of 
firms and out-of-scope firms because they are considered as having generated 
no shipments. Final weights are attached to each record on the data set of 


class 1 and 2 firms and on the data set of class 3 firms. 


Detailed diagnostic reports are produced. These reports are tables which pre- 
sent the data under various aggregates. They are useful tools to analyse the 


data and to perform final quality checks. 
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Ihe; data set of class 1 and 2 firms is cleared by discarding out -of-scope 
shipments. Some types of out-of-scope shipments are shipments to or from the 
U.S.A.3; shipments transported 15 miles or less from origin to destination; 
shipments which were off-highway; shipments which would be double counted as a 
result of interlining between road carriers; shipments which would be double 
counted because they were recorded by household goods movers who are van line 
agents and by the van lines themselves; shipments which did not generate any 
intercity transportation revenue; and records which relate to non- 
transportation services such as storage, packing, equipment rental, labour 


loading and unloading. 


Estimates of revenue, tonnage and tonne-kilometres for the publication are 
finally generated by summing the weighted data over the appropriate domains. 
Measures of error such as the coefficients of variation are also provided with 
the estimates. The coefficients of variation are obtained from the formula 
derived from the sample design but supposing the systematic sample of ship- 


ments is a simple random sample of shipments. 


14. USE OF THE DATA AND METHODS OF DATA DISSEMINATION 


14.1 Use of data 


Requests for estimates yielded from the old Survey came from a wide variety of 
Sources. The nature of these requests has also varied a great deal. It is 
expected that the nature of the demand for data from the new survey will be 


Similar to that in the past. 


The estimates have been used extensively to satisfy five main requirements, 
namely, to measure the volume of domestic trade transported by intercity for- 
hire carriers provincially and interprovincially; to measure the rate of 
industrial growth reflected by intercity commodity movements; to provide 
information on regional development; to assist in transportation studies; and 


to support the presentation of briefs, submissions and other inquiries to 
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regulatory authorities and commissions. 


One specific use of the data was to define the characteristics of trucking 
markets using variables such as commodities carried, average lengths of haul 
and shipment weight. Another specific study examined and analyzed selected 
aspects of performance of carriers operating in regulated and unregulated 
environment. The cost behaviour of these carriers was examined by using traf- 


fic characteristics such as shipment size and average length of haul. 


In the past, special requests for estimates from this survey have come from 
sources such as government departments concerned with trade, transport regula- 
tory officials at both federal and provincial levels; carriers; university 
consultants; industry associations; and many other organizations and indivi- 


duals who share a common interest in transportation. 


14.2 Methods of dissemination 


The redesigned survey will provide information in three modes similar to the 


old survey. 


First the publication will present the estimates that are generated by the re- 
gular system of the survey. Measures of error such as the coefficients of va- 
riation will be given with the estimates. Secondly, special requests will be 
processed subject to cost and reliability constraints. Finally, the data base 
of shipments generated by the survey may be made available on magnetic tape to 


selected users subject to constraints of confidentiality. 


15. FUTURE WORK 


As mentioned earlier in this paper, the survey accepts three types of input, 
one of which is computer tape from selected respondents. This type of input 


has been found difficult to handle and, although work has commenced on this 
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subject, progress so far is disappointing. Extensive negotiations are requi- 
red with the firms to obtain the requested data on tape and then a further 
analysis is needed to evaluate the data. For reference year 1981, only one 
tape will be used for a firm which handled about 5 million shipments during 
reference year 1981. More firms will provide data on tape in subsequent 
years. Agreements are presently being reached with 5 additional companies for 


reference year 1982, 


Nevertheless, when a firm's computer tape is finally obtained and found to 
meet our requirements, extensive Systems manipulation will still be required 
to handle the tape. Also, manual interventions will be necessary to handle 
non matches to the various libraries. Therefore, records will most likely 
have to be sampled on each tape using the same second stage sampling design as 


for the sampling of documents of class 1 and 2 firms. 


Another area where future work is needed is on having firms themselves sample 
their documents. AS an example, a company could photocopy the pro-bills 
ending in a given number when the pro-bills are issued, and send the photoco- 


pies to Statistics Canada on a monthly basis. 
Finally, major efforts will be made to evaluate thoroughly the various phases 


of the survey and to formulate recommendations for improvements. These recom- 


mendations will hopefully be implemented for the 1982 reference year survey. 
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THE METHODOLOGY OF THE CANADIAN AIR SCHEDULED INTERNATIONAL 
PASSENGER ORIGIN AND DESTINATION ESTIMATION SYSTEM! 


Greg Hunter and Lisa DiPiétro2 


The Air Scheduled International Passenger Origin and Destination 
(ASIPOD) estimation system uses the data from two air traffic 
surveys to produce origin-dest ination estimates of international 
passengers. The "assignment technique" is the solution to the 
problem caused by the non-coverage of non-interlining traffic. 
The assumptions of the technique are sufficiently questionable to 
warrant an evaluation of the bias of the estimates. However, 
major improvements will be made in the new system which will 
decrease the bias in the estimates. Also, estimates of relia- 
bibility will be produced. And as a result, knowledge of the 
strength of the inferences made with respect to air traffic 
markets from these estimates will be improved in international 
bilateral air negotiations. 


1. INTRODUCTION 


In 1979, Statistics Canada embarked on a revision of the federal aviation 
statistics program by inviting Transport Canada and the Canadian Transport 
Commission (i.e. the two "user departments") to form an interdepartmental 
revision team. The ASIPOD estimation system is one of several projects in the 


revision program. 


The ASIPOD estimation System uses the data from two air traffic Surveys to 
produce estimates of the number of passengers on scheduled international 
flights between Canadian and foreign markets for various origin-destination 
combinations. The first of these two surveys, the revenue passenger origin 
and destination survey, provides a sample of origin-destination data on inter- 


national journeys with Canadian carriers on at least one leq of the itinerary. 


- Presented at the Joint Statistical Meetings of the American Statistical 


Association in Cincinnati, August 1982. 


Greg Hunter, Business Survey Methods Division, Statistics Canada. 
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A coverage problem exists, since no data are available on those international 
journeys with foreign carriers on all legs of the itinerary. The second 
survey, the airport activity survey, counts all passengers entering or leaving 
Canada on all Canadian or foreign scheduled carriers without consideration of 


the passenger's origin or destination. 


This paper first outlines the requirements users have for international 
passenger origin and destination estimates. Then, the relevant aspects of the 
two air traffic surveys and the non-coverage problem are presented. And, 
finally, the paper describes how the ASIPOD estimation system will produce 

estimates of the number of passengers and the associated coefficients of 
variation for various international origin-destination pairs for the portions 


of the international market both covered and not covered by the first survey. 


2. USER REQUIREMENTS 


The users require estimates of international scheduled commercial air service 


passengers by origin and destination for bilateral air negotiations. 


An international scheduled commercial air service is defined to be an 
operation which is between points in Canada and points in any other country, 
and which provides public transportation of persons, goods or mail by aircraft 
in accordance with a schedule and at a toll or charge per unit of traffic. 


Such a service is referred to as a "unit toll" service. 


Before an international scheduled commercial air service can be operated into 
and out of Canada, some form of formal agreement must exist between the 
Government of Canada and the government of the second country. The formal 
agreement between countries may take the form of an interim diplomatic 


exchange of notes or of a complete negotiated Air Transport Agreement. 
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International bilateral air negotiations involve officials of the Canadian 
government from External Affairs, the Canadian Transport Commission, Transport 
Canada and the Ministry of Industry, Trade and Commerce. Negotiations may 


last several months or several years. 


The routes for scheduled air services are normally the major item for 
negotiation, but there are many others. Some of the items, or articles 


written into air transport agreements, may be: 


- Tights to fly across, or to make stops for non-traffic purposes in, a 
given territory 

- designation of the airline to operate each route 

- compliance with laws and regulations of each country, dealing with such 
issues as entry, clearance, immigration, passports, customs and 
quarantine 

- airworthiness, certificates of competency, and licences 

- exchange of statistical information 

- tariffs 

- transfer of funds 


- exemption from taxation of income 


In order to negotiate these items, and particularly for an exchange of routes, 
the negotiating officials must know where air traffic markets are and whether 
they are growing. Analyses of the costs and benefits to Canada and to 
foreign countries of various international routes for Canadian and foreign air 
Carriers must be available to the negotiating officials. To ensure that 
Canada can negotiate a fair market share, the provision of international 
Passenger origin and destination estimates is a necessity. A crude = and 
indirect indication of the value of such data to the Canadian economy is that 
the revenue? generated from all international air routes to and from Canada 


in 1980 was about 2.3 billion dollars (Canadian). 


; Taken from tabulations internal to the Canadian Transport Commission. 
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3. NATURE OF THE NON-COVERAGE PROBLEM 


An underlying assumption of the redesign is that the same basic methodology, 
as implemented in the existing system, is to be used. As a result, most of 
the feasibility work involved identifying desirable improvements, ranking 
their desirability and determining how much could be done within cost and time 
constraints. This "same basic methodology" provided direction with respect to 
the calculation of estimates of the number of international passengers, but 


not to the calculation of estimates of reliability of the passenger estimates. 


The target population of the international estimation system is the set of all 
tickets with an international (i.e. between a foreign country and either 
Canada or the United States) journey. An exchange program on passenger 0 & D 
data is maintained between Canada and the United States, whereby the United 
States gives Canada those records detailing the complete itineraries of the 


tickets collected in their survey on which: 


(i) a U.S. and Canadian point is shown in the routing, or 
(ii) a U.S. carrier is recorded as having flown to or from a Canadian point, 
or 


(iii) a Canadian carrier is recorded as having flown to or from a U2S .oqpoaite 


As a result of this exchange agreement, the expression "foreign" and "foreign 
(non-U.S.)" are both used in this paper to denote "neither Canadian nor 


American". 


The revenue passenger origin and destination survey collects tickets issued 
for international journeys, but only major Canadian carriers participate in 


the survey. Each participating carrier selects a flight coupon on a ticket 
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with a serial number ending in 'O', if that carrier is the first participating 
carrier to fly on a leg of that ticket. Hence, the survey reports a 10 
percent sample of unique flight coupons on which there is at least one 


participating Canadian or American carrier. 


Some information concerning the markets of foreign (non-U.S.) carriers is 
obtained from the revenue passenger origin and destination survey. For 
example, if a passenger travels from Ottawa to Montréal with Air Canada, then 
connects with Air France for Paris, the revenue passenger origin and 
destination survey will capture the trip because a Canadian carrier 
participated somewhere in the journey. The Canadian carrier would report the 


complete carrier and routing detail, including the Air France segment. 


The revenue passenger origin and destination Survey, however, does not cover 
coupons with foreign carriers on all legs of an itinerary. An example of such 
an itinerary would be that of a passenger flying on Air France from Paris to 
Montréal and then back to Paris on Air France. If this itinerary were the 
passenger's total journey, this Journey would not be reported to the revenue 
Passenger origin and destination survey. However, the itineraries of such 


Passengers are in the target population of international journeys. 


This incomplete coverage of the target population is the non-coverage problem 
for the ASIPOD estimation system. The coverage problem seems to be 
"non-coverage" as opposed to "undercoverage", since it is not even possible to 


include a large portion of the universe in the frame. 


The existing system takes the revenue passenger origin and destination survey 
data and the airport activity survey data and applies a method called "the 


assignment technique" in order to produce total market estimates. 


The airport activity Survey counts passengers, on a census basis by Plaghe; 
entering and leaving each Canadian airport. The survey covers all Canadian, 
American and foreign scheduled carriers, but it does not consider the 


Passengers' initial origin or final destination. 
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Hence, the airport activity survey provides a count of the total volume of 
passengers for all carriers in the target population. The assignment 
technique is a method of estimating the non-coverage volume of passengers and 
assigning it to origin-destination pairs. However, in order to explain the 
assignment technique, a somewhat more thorough description of the two air 


traffic surveys is required. 
4, RELEVANT ASPECTS OF THE TWO SURVEYS 


The authorizing agency for the two survey programs is the Air Transport 
Committee of the Canadian Transport Commission, in co-operation with Transport 
Canada. The data are collected from the air carriers on behalf of the Air 
Transport Committee by the Aviation Statistics Centre (ASC): sofenStatistics 
Canada. Under the authority of the Air Carrier Regulations of the Aeronautics 
Act, reporting by the carriers on the ASC statements (ie. questionnaires) is 


compulsory. 


The Revenue Passenger Origin and Destination (0 & D) statistics are reported 


to the ASC via Statement 35. The reported data items, among others, include: 
- ticket origin and ticket destination 
- points of intra- and interlining (i.e. routing) 
- carrier on each flight coupon stage 


The revenue passenger origin and destination data are submitted monthly by 
major Canadian unit toll air carriers conducting scheduled passenger services. 
Since January 1, 1982 the seven Canadian carriers contributing information to 
this survey have been Air Canada, CP Air, Eastern Provincial Airways, Nordair, 
Pacific Western Airlines, Air Ontario and Quebecair. The American data are 
collected by the Civil Aeronautics Board from all certificated United States' 
air carriers, except helicopter operators and intra-Alaska carriers. The data 


for the three months of each quarter are combined, and duplicates are 
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eliminated; so that a file of complete itineraries, Jicket Origin and 


Destination (TOD) records, is obtained for the quarter. 


However, passenger origin and destination statistics are compiled using the 
Directional Origin and Destination (DOD) concept. The DOD concept can be 
defined as "points of initial departure and ultimate destination named in the 
sequence which indicates the direction of travel". DOD's are pieces of itine- 
Taries which are broken up such that each component piece defines a reasonably 
consistent direction. To create DOD's, "open-jaw" and return itineraries, 
such as "symmetrical" and "circle" itineraries, must be broken into pieces 
which are essentially one-way trips. To obtain the DOD's, the TOD's are 
passed through the breakpoint routines. This breakpoint process is automated 
within the Passenger Origin and Destination System, and involves the calcula- 
tion of various point-to-point distances within the itinerary and the compari- 
son of these distances to the total itinerary length. As a general rule, iti- 
neraries are broken at the farthest point from the origin. Each DOD formed is 
recycled through breakpoint routines until no further breakpoints can be 


assigned. 


The airport activity data are filed on Statement DZS The relevant items 


included for each flight are: 
- the reporting carrier 
- the reporting airport 
- the point of origin and final scheduled destination of the Flight 


- the last station arrived from, for arrivals; or next station departed to's 


for departures 
- the number of deplaned or enplaned revenue passengers 


The airport activity data are submitted monthly by Canada's transcontinental 
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(Air Canada and CP Air) and regional air carriers (Eastern Provincial, 
Quebecair, Nordair and Pacific Western Airlines), by Norcanair and by all 
foreign carriers (including American carriers) operating scheduled 
international flights into and out of Canada. Since January 1, 1982, there 
have been 10 American and 21 other foreign carriers filing reports for each 
Canadian airport they served. Each new foreign carrier, granted a licence to 
serve Canada on a scheduled basis, is automatically included as a participant 


in the airport activity data collection system. 


From the airport activity data, the census traffic flow data are obtained. 
"Traffic flow" can be defined as a count, over a certain period of time, of 
the number of persons who are flying on a specific carrier between a Canadian 
reporting airport and an adjacent point. The adjacent point is called the 
next stop or the last stop. For the purposes of the assignment technique only 
the traffic data for foreign (non-U.S.) carriers are input into the system. 
The data elements extracted from this survey, and used to determine the O & D 
international markets, are the number of revenue passengers enplaned and 
deplaned in Canada, the Canadian gateway carrier, and the Canadian gateway. 
In this survey the concept of "Canadian gateway" is defined to be that 
reporting airport at which a_ foreign (non-U.S.) carrier enters or leaves 


Canada. 


However, in the revenue passenger O & D survey, the Canadian gateway for 
Canadian and U.S. carriers is the first/last Canadian point in the itinerary 
for a flight entering/leaving Canada. For foreign carriers the Canadian 
gateway refers to the point inside Canada where the passenger enters or leaves 


the foreign carrier. Consider the following fictitious example: 


Assume that Air France flies Toronto - Montréal - Paris. Some passengers 


enplaned in Toronto, and some enplaned in Montréal. 


Assume also that the matching single crossing DOD's are: 


Winnipeg - Air Canada - Toronto - Air France - Montréal - Air France - Paris. 
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poronto’’=~Air Fratice’ — Montréal 92 Air-France: ~< Paris - British Airways - 


London 


The Canadian gateway in the above example would be Toronto because Toronto is 


the point at which the Passengers enter the foreign carrier. 


>. THE OBJECTIVES OF THE REVISION 


The four main objectives, the fourth of which Will be discussed in detail in 


this paper, are as follows: 


ay 


(ii) 


to eliminate problems which have been identified in the existing system. 


The existing system does not impute for illegible carrier codes on 
flight coupons, so that "unknown carriers" becomes the third largest 
Carrier in tabulations. Also, there is no check on the coding which 
indicates whether a carrier is flying to and from airports at which it 
actually has landing rights. As a result of even existing tabulation 
requirements, additional edits and imputations, to handle illegible or 
incorrect data on international flight coupons, are required over and 


above those required for domestic flight coupons alone. 


Other nonsampling errors have been identified, but can not be easily 
corrected by an estimation system. Examples of such errors are 
misinterpretation by participating carriers of instructions for 
selecting flight coupons; systematic errors in serial numbers, used for 
sample selection, on ticket stock; errors in the carriers’ processing 
systems, etc. The control of these nonsampling errors for which it is 


not easy to correct is not an objective of this revision. 


to develop a simple computing system which is easy to use, and produces 


summary diagnostic information. 


ee 


Some 625,000 origin and destination records, and some 820,000 airport 
activity records must be processed annually with minimal manual 
intervention. Since large volumes of passengers are dispersed across a 
large number of origin-destination pairs, the diagnostics at each stage 
of the system must summarize the processing, and still be able to point 


out potential problems. 


(iii) to tabulate the international passenger estimates, regularly and on an 
ad hoc basis, in ways which will simplify the analyses undertaken by 


uSETS. 


(iv) to produce quantifiably reliable estimates of the number of air 


scheduled international passengers by origin and destination. 


Although the reliability of these statistics has been thought to be 
variable in the past, it has been, in fact, unknown to date. Estimating 
the reliability of these data will improve the knowledge of the strength 
of the inferences that can be made from analyses of these data. 
Inferences made without knowledge of the reliability of the data could 


actually be quite misleading. 
6. SOLUTION OF THE NON-COVERAGE PROBLEM 
6.1 Magnitude of the Problem 
As in other surveys, non-coverage is a problem for the ASIPOD methodology, 
since no sample data are available on the origin-destination patterns of the 
non-coverage portion of the target population. 
The following table gives an indication of the volume of passengers travelling 


between Canada and nine world areas. (These data are 1979 annual estimates. 


The world areas are not identified because these data are confidential.) 
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Table 1 - Estimated Number of International Passengers - 1979 
tonal Passengers, - 17/7 


Between Canada Revenue Passenger Non-coverage 
and <4. Origin and (percentage of Total 
Destination Total in brackets) 

ee Se ei dent ari tons, fs: 
World Area #1 116,050 495 (0.4) 116,545 
World Area #2 O25 200 2,020 (0.6) Vat ayaa 
World Area #3 99,010 12,628 (11.3) 111,638 
World Area #4 67, 200 14,558 CD) 81,758 
World Area #5 5735010 126,076 MIBE2:) 703, 086 
World Area #6 205,180 101,464 (25n1) 306,644 
World Area #7 1,221,040 908, 876 (AZa7) Zig V2I GONG 
World Area #8 54,410 56,807 Sia) i A 
World Area #9 45, 330 Dig 2 Gil, (55.8) 1025597. 
Total World 2,708,430 P2274 191 C3209) 3,9905621 


From the percentage non-coverage figures, it is evident that the non-coverage 


problem is a major concern. 


The same table as above, but between Eastern Canada and the same nine world 
areas, would tend to have a higher percentage non-coverage for each world 
area. Hence, a lower level of geographic aggregation in origin-destination 
pairs generally implies a higher percentage non-coverage. For example, the 
non-coverage for Eastern Canada to World Area #7 is 55 percent, compared to 
the 42.7 percent tabulated for all of Canada to World Area #7 (as above). To 
clarify the idea that a higher level of geographic aggregation in origin- 
destination pairs generally implies a lower percentage non-coverage, consider 
the fact that non-coverage exists only *- for ~ foreign _ traffic terminating at 
Canadian gateways. There is complete coverage of all traffic for which the 
Canadian end of the origin-destination pair is not a Canadian gateway. Hence, 
as the level of geographic aggregation becomes higher, more and more inter- 


lining traffic is included, and the percentage non-coverage becomes lower. 
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6.2 The Assignment Technique 


The assignment technique estimates the non-coverage volume of passengers and 


then allocates this volume to origin-destination pairs. 


From the airport activity survey, b benchmark counts of passengers entering 
and leaving Canada at Canadian gateway airports are tabulated by carrier. The 


value of b, the number of assignment groups, can be determined as follows: 


where the '2' accounts for the fact that there is one count each for passen- 
gers entering and leaving Canada, 
g is the number of Canadian gateway airports, and 


n. is the number of foreign (non-U.S.) carriers with landing rights 


at the ith Canadian gateway airport. 


From the revenue passenger origin and destination survey, a corresponding 
number of inbound and outbound passengers on international DOD's can be 
tabulated by crossing carrier and Canadian gateway airport. Hence, there are 


also b such counts from the sample survey data. 


In the first stage of the assignment technique the non-coverage volume, Aj, 


of passengers in assignment grouping i can be estimated as follows: 
Ae oe eer) x Dj (i = tererets br 


where Ci is the airport activity census count in assignment grouping 1. 


fi is the revenue passenger O & D survey sampling fraction Cie" 
WAL 
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Dj is the sample number of international passengers in assignment 


grouping i. 


Aj is, then, an estimate of the number of passengers, carried on 
the foreign carrier in the ith of 5 assignment groupings, for 


which there is no origin-destination information. 


The next stage allocates the non-coverage volume to origin-destination pairs. 
Such pairs in the non-coverage portion are called non-interlining DOD's, since 
they are DOD's, flown by foreign carriers, which do not interline with a 
participating Canadian carrier. The assignment technique imputes non- 


interlining DOD's in the ith assignment grouping as follows: 


Ci) All of the DOD's contributing passengers to Dj are 
identified. (These would be DOD's which match on Canadian gateway, 


foreign carrier and direction.) 


(ii) The domestic portion of such DOD's (i.e. the portion from the 
Canadian point to the Canadian gateway city) is eliminated. (The 
domestic portion of such DOD's would be on a Canadian carrier, and, 
therefore, would be picked up in the revenue passenger 0 & D survey. 


The resultant "truncated DOD's" are, then, non-interlining.) 


(iii) The non-coverage volume, i.e. Aj, is assigned to the resultant 


"sample DOD's" in proportion to their original contribution to Dj. 
New DOD records consisting of "assigned passengers" are produced. 


The assignment technique assumes that the truncated DOD's are representative 
of the non-interlining traffic. As a result, some original sample DOD's 
are receiving more weight than they would in the revenue passenger O & D 


Survey alone. Hence, the estimator, dj » for the total market number of 
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passengers for the jth origin-destination pair can be derived by adjusting 
the sampling fraction as follows: 


Aa 


dj = (1/F 5) x dj 
3 
where fs = ————~—_———-_ (1) 
J ; ; 
(dj/f) + aj 
and where ae is the adjusted sampling fraction associated with the j 
J 
origin-destination pair, from the revenue passenger 0 & D 
sample survey. 
dj is the sample number of international passengers in the 
jt) origin-dest ination pair, from the revenue passenger 


0 & D sample survey. 


aj is the number of passengers assigned to the awe origin- 
destination pair. 


Note that aj, according to point (iii) above, 


is aj eed RY Ai jel 
D. 
i 
where Ds dj = Di 
jel 
and 2 eee Aj 
jei 


6.3 An Example of the Assignment Technique 


A simple example will illustrate how the assignment technique works. Assume 
that the airport activity data from British Airways indicated that 120 
passengers enplaned at Montréal and went to London. Therefore, Cj =°120. 
Assume that the only two DOD's for this assignment grouping from the revenue 


passenger origin and destination survey are: 


Sample Number 
of Passengers 
1 
Zz 


where the codes are to be 
Code 


YWG 
YMX 
LON 
WAW 
YVZ 
HAM 
AC 
BA 
LO 
cP 
LH 


Therefore, Dip r FSymusi; 


The truncated DOD's, 


and Aj = 


the 


Estimate of the Number 
of Passengers 


10 
20 


interpreted as: 


120 - 


proportion 


C4/ ¥1)) 


of their 


DOD's 


YWG-AC-YMX-BA-LON-LO-WAW 
YYZ-CP-YMX—-BA-LON-LH-HAM 


Denotes 


Winnipeg 
Montréal (Mirabel ) 
London 

Warsaw 

Toronto 

Hamburg 

Air Canada 
British Airways 
LOT 

CP Air 
Lufthansa 


x 3 = 908 


contribution to Dj, the 


resultant number of assigned passengers and total market estimates are then: 


Truncated DOD's 


YMX - BA - LON - LO - WAW 
YMX - BA - LON - LH — HAM 


Proportions 


V/s 
Ze 


Assigned 
Passengers (4j) 


30 


= 42 = 


6.4 Shortcomings of the Assignment Technique 


The assignment technique has some recognized shortcomings. 


The basic assumption of the assignment technique is that the truncated DOD's 
are representative of the non-interlining traffic. Consider the hypothetical 
example above in order to determine whether this assumption is intuitively 
reasonable. The assignment technique presumes that passengers flying on a 
particular foreign air carrier and originating in Montréal would have the same 
ultimate destination as passengers originating in Winnipeg or Toronto who fly 
through Montréal. Hence, the travel patterns of ethnic communities, for 
example the Polish and German communities in Toronto and Winnipeg respective- 
ly, might be used to impute for travel patterns of the more predominantly 
French communities in Montréal. And, in fact, the assumption that interlining 
and non-interlining travel patterns are the same was proven empirically to be 
suspect in a pilot test (Rosen and Conroy (1977)) conducted by the Canadian 
Transport Commission in 1977. Therefore, there is not only some intuitive but 
also some empirical evidence against the basic assumption of the assignment 


technique. 


The accuracy of the estimates of the number of passengers by ofrigin- 
destination pair is jeopardized by any violations of the assumption that trun- 
cated DOD's are representative of non-interlining traffic. Large volumes of 
passengers are allocated to origin-destination pairs, as was seen in Table 14 
above, based on a small "effective" sample size. For example, the "effective" 
sampling fraction of the Canada - World Area #7 market is, not 10% as in the 
domestic survey, but 5.7% (ie. (100% - 42.7%) x 10%), because of the non- 
coverage of non-interlining traffic. Hence, a smaller than 10% sample of 
DOD's is used to allocate a large volume of passengers. Violations of the 
aforementioned assumption, then, would cause a potentially large bias in the 


estimates. 


Since airport activity census counts are used as benchmark figures for traffic 


volumes, their accuracy is very important. Although no evaluations have been 
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undertaken to investigate the magnitude of bias from nonsampling errors in the 
airport activity census counts, aviation statistics economists feel that this 
is a survey in which such errors would be small. There are currently, however, 
ongoing discussions with the major Canadian air carriers on how the reporting 
requirements of government agencies can be minimized. As a result of these 
discussions, the airport activity survey could become a sample survey. If a 
sample is to be designed, the accuracy of the activity counts of gateway 


airport passengers would be an important design consideration. 


The assignment technique also assumes that ‘allasofie the non-coverage is 
accounted for by the non-interlining traffic. An interesting way of valida- 
ting this assumption would be to compare, for the same reference period, the 
airport activity census count for a Canadian air carrier to the analogous 
estimate from the origin-destination sample survey. This analogous estimate 
would be the sum of all Passengers on DOD's with the same Canadian gateway 
airport and crossing carrier. It would be necessary to be able to determine 
whether differences in the estimates were ascribable to differences in the 
concepts of the two Surveys, and if so, and whether these differences have 
been taken into account in the ASIPOD estimation system. If such differences 
have not been accounted for, then it could be that there is a problem with the 
assumption that all of the non-coverage is accounted for by non-interlining 


traffic. 


Many of these shortcomings should be investigated. However, there are major 
problems in the existing system, and no alternative solutions, which are 
Superior to the same basic methodology of the assignment technique and which 
can be implemented within time and cost constraints, have been found. Further- 
more, the improvements in the estimates of the number of passengers by air 


traffic market in the new ASIPOD system will be substantial. 


The use of the assignment technique to estimate for international markets is 


an innovative solution to a large problem. It does not completely solve the 
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non-coverage problem, but it is a major step in the right direction, as will 
be the production of estimates of the variance of the estimates. These 
variance estimates should take account of the assignment technique and its 
assumptions, and, at the same time, give a meaningful measure of the 


reliability of the DOD estimates. 


7. ESTIMATION OF VARIANCE 


The estimator of the variance of the international origin-destination 
estimates is a simple extension of the variance estimator for the revenue 


passenger origin and destination survey. 


7.1 Variance of the Estimate of Interlining Traffic 


The development of the estimator of the variance of the revenue passenger 
origin-destination estimates is dependent upon the way in which tickets con- 
tribute passengers to origin-destination pairs (i.e. the domains’ of 
interest). Recall that each ticket is selected with probability 0.1, and that 
each ticket may be broken up into several segments or DOD's. Each ticket may 
contribute 0,1,2, etc. passengers to a given domain of interest. For example, 


the itinerary 


YWG - AC - LON - BA - YYZ 


would be broken, via the breakpoint routines, into the two DOD's 


YWG' = "AG = LON 
and 
LON ~ BA - YYZ. 


Consider the inbound plus outbound estimates which are total passenger 
figures, independent of direction. For such estimates this ticket would add 


passengers to, among others, the following domains of interest: 
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Domains of Interest Passenger Count 
Winnipeg - London 1 
Toronto -—- London 1 
Canada - London 2 
Canada - Europe 2 
Eastern Canada - Europe 1 
Western Canada - Europe 1 


Note that the number of passengers per ticket depends on the geographic level 
of aggregation, and, therefore, on the particular origin-destination estimate 


(i.e. domain of interest). 


The estimate, d, of the number of passengers from the revenue passenger 
origin and destination survey can be developed for a particular domain of 


interest as follows: 


where xj; is the number of DOD's belonging to the domain of interest on the 
ith ticket. 


n is the number of sample tickets selected, and 


N is the population number of tickets in the revenue passenger 


origin and destination survey. 
f is the sampling fraction, i.e. n/N = .1 


The revenue passenger origin and destination survey sample is effectively a 


10% simple random sample because 


(i) the selection of coupons with serial numbers ending in the digieho" 


produces a systematic sample, and 


2EAG ae 


(ii) there is no cycle associated with the distribution of tickets which 
would cause a relationship between the survey estimates and the last 


digit of the serial number. 


The estimate of the variance can be written, then, as 


var (d) SN Cl = eee 
n 
1 n 9 
whete Ve) =e CK ee) 
s 1 
eed} 1S 
-- i) i 
AN ae wie stars ahi Seema 
and the coefficient of variation, as 
cv(d) = Wvar(d)/d. 
Note that var (d) can also be written as 
a a (1-f) on = 
var(d) = Ce ze (520) 
fo ei 


where n is assumed to be sufficiently large for n/(n-1) to be approximately 


equal to 1. 


7.2 Variance of Total Market Estimates 


The method for calculating total market coefficients of variation has to 
recognize that the assigned data are a function of the sample data. In other 
words different samples will produce different assigned data and, thereby, 


different values of the total market estimate. The re-use of certain portions 
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of the sample has an effect on the sampling distribution of the estimates. 


The method which will be used adjusts the sampling fraction from 10% to be the 
percentage for which sampled records for a particular domain of interest are 
actually accounting. Since the use of the assignment technique is a given, it 
has to be assumed that truncated DOD's from the revenue passenger origin and 
destination survey are representative of the non-interlining traffic. The 
measure of reliability will be a measure of the precision of the DOD estima- 
tes, only to the extent to which this assumption is valid. As a result, it is 
reasonable to adjust the weights of the sample DOD's to take into account the 


non-interlining DOD's. The sampling fraction, then, for the estimation for the 


jth domain of interest would be Fj as developed in equation (1) above. Fj 


would replace f in the formula for var(d) in equation (2) above in order to 


yield the formula for the variance of the total market estimates: 
AAT 

var (dj) = 

And, the coefficient of variation for the total market estimate would be 


a aly ATE aT 
ev(dj) = var (dj) Wf dj- 


Note that Py <o10) for ce 0. This means that less than a 10% sample is 
achieved when it is necessary to include assigned data in a total market 
estimate. Hence, by using Fj instead of .10 in the expression for the 
variance of a DOD estimate, the coefficient of variation is adjusted to relate 


to the total market estimate. 


This method gives credit to the use of sample data in the assignment 
technique; but it is dependent, as is the determination of the estimates 
themselves, upon the assumption that truncated DOD's are representative of 


non-interlining traffic. 
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8. FUTURE CONSIDERATIONS 


Earlier, the need for data on air traffic markets in international bilateral 
air negotiations was explained. The exchange of statistical information is 
one of the negotiable articles in air transport agreements. Currently, 
agreements for the exchange of statistical information exist with several 
countries. The concepts and quality of the data from some of these countries 
indicate that these data could be used in the ASIPOD estimation system. Such 
data would provide sample information on the non-coverage portion of the 
international universe of tickets. Feasibility work is currently underway to 
determine whether the number of countries for which these data can be used 
would improve the accuracy of a sufficient number of estimates to justify the 


expansion of the ASIPOD system to use "exchange data". 
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SOME ASPECTS OF QUALITY OF 
CANCER MORTALITY AND INCIDENCE STATISTICS 


D. Binder and A. Malhotra! 


Statistics Canada, Canada's central statistical agency, has been 
compiling national mortality statistics, including those on cancer 
mortality since 1921. Also, cancer incidence data are available 
from 1969. 


The data quality of these files may be assessed in a variety of 
ways. Ratios of cancer mortality to incidence give some informa- 
tion on coverage errors. Micro-data matches between incidence and 
mortality files give an indication of misclassifications. As 
well, multiple registrations for cancer incidence may be duplica- 
tes. Completeness and availability of data items are also impor - 
tant for special studies. 


In this paper, the feasibility of using these measures of data 
quality and the implications of these measures are discussed. 


1. INTRODUCTION 


Population based cancer statistics are the basis of epidemiological research 
into the distribution and determinants of cancer and underlie health program- 
mes for the prevention, diagnosis and treatment of cancer. Statistics Canada, 
Canada's central statistical agency, compiles two such types of data on can- 


cer. 


1. National mortality data which are based on reports from pro- 
vincial vital statistics registration systems. These data 
date back to 1921. 


2. National cancer incidence data which are based on notifica- 
tions from provincial cancer registries. This data series was 


established in 1969. 
Spy Binder, Institutional and Agriculture Survey Methods Division, 
Statistics Canada and A. Malhotra, Health Division, Statistics Canada. 
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Good statistics which provide reliable information on risk differentials 
depend on completeness and accuracy of cancer registration and comparability 
of the data between different registration areas and time periods. Cancer 
incidence and mortality data as reflections of the true cancer risk each have 


their own merits and limitations. 


Cancer incidence data are particularly suitable for the epidemiological study 
of cancer because they provide information on all cancers, not only those that 
are fatal, because they can provide an early warning of emerging problems and 
because the diagnostic information is usually detailed and of a high quality. 
For example, the publication Cancer Incidence in Five Continents [1] empha- 
sizes the role that international comparisons of cancer incidence play in 
yielding clues about the causes of cancer in spite of certain well known limi- 
tations of the data. These limitations include difficulty in achieving com- 
plete registration of new cancer cases and differences between registries in 
the extent to which this is achieved. Major factors influencing coverage are 
the number and types of data sources used, how active case finding is, the 
length of time a registry has been in Operation and whether or not the repor- 
ting of cancer is a legal requirement in the registration area. Canadian can- 
cer registries are quite heterogeneous in their data collection methodology 
but all attempt to follow international [2] as well as national [3] guidelines 
for the standardized recording of cancer incidence data. Differences in sour- 
ces and techniques of registration not only influence coverage but also other 
aspects of data quality such as detail of socio-demographic and geographic 
information that is provided. Also, cancer incidence data are sensitive to 
Such factors as mass screening programmes which result in the inclusion of 


previously undiagnosed prevalent cases. 


The editors of Cancer Incidence in Five Continents use and discuss a number of 
indices which may be useful in assessing completeness of registration and 
Teliability of the data [4]. These include cancer mortality-incidence ratios 
as indicators of completeness of registration (see Sections 2.1 and 5 wii f: 


this paper). 


The reporting of deaths is a legal requirement in most developed countries so 


that coverage error is assumed to be small. Known and suspected limitations 
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of mortality data for purposes of epidemiologic cancer research include a lack 
of information on non-fatal cancers, less precise diagnostic information and 
more frequent misclassification due to assignment and coding of underlying 
cause of death resulting in less accuracy compared with the diagnostic infor- 


mation in cancer incidence statistics. 


A quality assessment of vital statistics [5] which was undertaken as a pilot 
study gives some indication of the quality of Canadian mortality data (all 
causes), particularly on error rates in the coding of underlying cause of 
death. This error rate was 7.2% in the data year 1976. About two-thirds of 
the errors involved the first or second digit of the 4-digit cause of death 
code. Variation in the error rate by specific causes of death was not inves- 


tigated. 


This paper is concerned with assessing the feasibility of measuring certain 
aspects of quality of the two cancer data files at Statistics Canada which are 
used in epidemiological studies, namely the cancer incidence file and subset 
of records from the mortality file with an underlying cause of death of 


cancer. 


The aspects of data quality selected for investigation were: 


1. Completeness of registration of new cancer cases through a comparison of 
cancer mortality with cancer incidence in the same period. This compa- 
rison is a crude but commonly used indicator of completeness of registra- 


tion. 


2. Consistency of assignment of diagnosis and cause of death codes 


through matching individual records on the two files. 


3. Availability and completeness of data items through an analysis of 


how often valid values are present on the files. 


4, Registration of multiple primary cancers on the incidence file 


through a matching of records within the file. 
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The period covered by the study is 1969-1978, the period for which cancer 
incidence data are available. Ontario was excluded from all investigations 
because the National Cancer Incidence Reporting system includes data for this 


province for 1969-1971 only. 


A discussion of the approach taken in the investigations is contained in 


Section 2 of this report. In Section 3 the findings are discussed. 
2. DESCRIPTION OF MEASURES 


In this section we describe some methods for studying the data quality of the 


cancer mortality and incidence files. 
2.1 Mortality-Incidence Ratios 


In order to study relative rates of undercoverage among cancer incidence 
registrations, we consider the ratios of deaths to incidents of cancer. Since 
the Mortality System registers all deaths in Canada by cause of death and the 
concept of the National Cancer Incidence Reporting System (NCIR) is to regis- 
ter all new incidents of primary malignant neoplasms, if the two registra- 
tion systems were of the same quality for all reporting registries, one would 
expect that the ratio of mortality rate to the incidence rate for a particular 
site would be fairly consistent within a population of given age and sex over 
sufficiently long periods of time. (We compute these ratios for deaths and 
incidents over 5-year and 10-year periods to reduce the effect of the time lag 
between the reporting of a cancer incident and death). Inconsistency of this 
ratio would arise if any of the following rates differ across reporting regis- 
tries: 

(a) rates of survival or sudden increase rate of incidence 

(b) mortality rates from other competing risks, 


(c) rates of error in coding of underlying cause of death, 


2 Ontario developed a passive registration system which makes use of reports 
on cancer patients made for other purposes. Data for recent years are 
currently being prepared by the province. 
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By exception, metastatic cancers are registered when the primary site is 
unknown. 
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(d) rates of error in classifying cancer site for new cancers, 


(e) rates of under-reporting or over-reporting of cancer incidents. 


If the mortality-incidence ratios are grossly different for most sites, then 
differing rates in (a) and (b) can be discounted. With respect to the error 
rates in (c), studies on the coding error for underlying cause of death have 
yielded error rates of less than 10% [5]. In an unpublished report it was 
found that these errors rates could vary from 3% to 18% across reporting 
registries. However, the observed differences in the mortality-incidence 
ratios (see Tables 1 and 2) cannot be completely explained by these error 
rates. Also, since about 90% of the registrations of new cancers are confir- 
med histologically, the error rate in (d) would be small. Therefore, these 
mortality-incidence ratios do give an indication of the coverage error in the 
NCIR System. 


Concentrating on those sites with leading diagnosis count (excluding skin can- 
cer), based on the NCIR file, for each sex, we show in Tables 1 and 2 the 
national mortality-incidence ratios as well as provinces with largest and 
smallest ratios, for two five-year periods, broken down by age-groups. We 
omit Prince Edward Island from consideration because the number of observed 


events is too small for valid comparison. 


2.2 Matching Mortality and Incidence Records 


In order to assess the feasibility of evaluating errors in cause of death 
classification, or errors in cancer site classification, a sample or records 
can be selected from either the Mortality File or Cancer Incidence File and 
then the other file can be searched for matching records. The manual search 
does not guarantee that all true matches will be found, and, in fact, the rate 
of successfully matching may be different across reporting registries, because 
the level of detail of matching variables can vary from one registry to 


another. (See Section 2.3 for a study on availability of data). 
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Records with malignant neoplasms of the lung or bronchus (ICDA-8* is 162.1) 
for the years 1969-1978 were selected as starting points from both files. 
This choice was based on considerations of high incidence, high mortality and 
short survival times so that conditions were favourable for Finding matching 
records: on’ both files. In spite of this, because of the time difference 
between diagnosis and death, it is true that cancer deaths in the earlier 
years and newly diagnosed cancers in the later years are less likely to have a 
corresponding record on the matched file. The analysis of the results would 
be improved in future studies if the sample design controlled for year of 


death and year of diagnosis. 


Two independent samples were selected: one from Mortality File and the other 
from the NCIR file. 


2.2.1 Mortality to Incidence 


All deaths from cancer between 1969 and 1978 should have, at least conceptual- 
ly, a corresponding record on the NCIR system. Noteworthy exceptions to this 
Tule are that the cancer was first diagnosed in Ontario or outside Canada or 
that the cancer was first diagnosed before 1969. Besides the exceptions, a 
lack of a corresponding record on the NCIR system is an indicationn of under- 
coverage. This, of course, assumes that the underlying cause of death in the 


Mortality File is error-free. 


Therefore, if a sample of deaths from cancer are selected and matched to the 
NCIR system, we have a number of possible outcomes: 

(a) no matching record is found, 

(b) a match was found with a record having a different cancer site, 


(c) a match was found with a record having the same cancer site. 


If no match is found, this is an indication of under-coverage, or that the 
cancer was first diagnosed in Ontario or outside Canada or prior to 1969, or 


that the death was not really a death due to cancer. Alternatively, as 


* International Classification of Diseases, Adapted for Use in the United 


States, Eighth Revision. 
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previously mentioned, the matching process itself is not perfect. If a match 
is found, but the records have different cancer sites, this may be an indica- 
tion of an error on either the Mortality File or the NCIR System. As men- 
tioned previously, it is generally believed that the NCIR system yields the 
more accurate disease classification, because of the high rate of histological 


confirmation. 


A small scale study was undertaken to measure this phenomenon. A random sam- 
ple of 56 records with underlying cause of death reported as a malignant neo- 
plasm of the lung or bronchus (ICDA-8 is 162.1) was selected from each of the 
provinces except Ontario yielding a total sample of 504 records. Only deaths 
between 1969 and 1978 were selected. 


The national rate of successful matches was 82.3%. The rates varied between 
73.2% and 96.4% across the nine provinces. Among those with successful 
matches, 92.5% had the same 4-digit ICDA-8 classification. These rates varied 
from 74.4% to 100.0%. For those provinces with the lowest and highest rates 
of matches with the same ICDA-8 classification, we give in Table 3 the break- 


down of the observed disease classifications. 
2.2.2 Incidence to Mortality 


A sample of records from the NCIR System was also selected and matched to the 
Mortality File. Fifty-six records with malignant neoplasms of the lung and 
bronchus (ICDA-8 is 162.1) from each of the nine reporting registries were 
randomly selected yielding a total sample of 504 records. The matched records 
were then checked for underlying cause of death on the complete Mortality File 
for 1969 to 1978. We did not check the cause of death on the original death 
certificate for this study, although this would be feasible for future 


studies. 


The outcomes from this manual match may be classified as follows: 

(a) no matching record is found, 

(b) a match was found with a cause of death other than cancer, 

(c) a match was found with a cause of death being cancer but not cancer of the 


lung or bronchus, 
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(d) a match was found with the same cause of death. 


If no match is found, this is an indication of the inadequacy of the matching 
process, unless death occurred after 1978 or outside Canada or the person is 


StUtllalive. 


A match found with a different underlying cause of death is an indication of 

one of the following: 

(a)iea competing risk took precedence, 

(b) cancer was a contributing cause of death but not the underlying cause of 
death or 

(ec) the underlying cause of death on the Mortality Data Base was incorrect, or 
the cancer site was incorrect on the NCIR system (the latter being assumed 


less likely). 


The average rate of successful matches was 69.4%. The rates varied between 
55.4% and 80.4% across the nine provinces. Among those with successful 
matches, 92.0% had the same 4-digit ICDA-8 classification. These rates varied 
from 85.3% to 100.0%. For those provinces with the lowest and highest rates 
of matches with the same ICDA-8 classification, we give in Table 4 the break- 


down of the observed cause of death classifications. 
2.3 Availability and Completeness of the Data 
$$ $$$ tess or the vata 


One simple measure of the quality of the data files is the relative frequency 
of valid data for specific items. For the National Cancer Incidence Reporting 
System and the cancer deaths on the Mortality File, we concentrate on the fol- 
lowing items: 

- date of birth (day, month, year) 

- age 

- place of birth 


- county and subdivision of residence 


We chose these items to exemplify how easy or difficult it would be to match 
tecords from other files (e.g. Section 2.2), or to create special tabulations, 


such as small area statistics. For each item we classify the data as being 
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valid or invalid. Besides blank values, invalid data would arise when alpha- 
betic characters are found in a numeric variable or the numeric value is out 
of range. We have aggregated the relative frequencies into two five-year 
groupings (1969-1973 and 1974-1978) so that we can see whether the quality has 


changed significantly in the later years. 


In Tables 5 and 6 we report the national averages for the two data bases as 
well as show the values for the provinces with largest deviation from the 
national average. For the Mortality File, we give the results only for cancer 
deaths in the nine provinves outside of Ontario so that the comparison with 


the NCIR system is more meaningful. 


2.4 Multiple Registrations on the Cancer Incidence System 


The concept of the National Cancer Incidence Reporting System is to register 
all new incidents of malignant neoplasms. An individual should be registered 
more than once when multiple malignant neoplasms develop. To avoid duplicate 
registration of the same incident or duplicate reporting of patients registe- 
red in more than one province, all provincial cancer registries follow routine 
procedures. In spite of this, duplicate reporting of the same cancer incident 
may occur. In order to evaluate the extent of the duplication, we searched 
for records which are likely duplicates. The search was nowhere near exhaus-—- 
tive, so that the number of potential duplicates found is an underestimate. 
Of the 457,158 records, we removed the records with invalid surnames or years 
of diagnosis. For those with missing birthyear, we calculated the birthyear 
from the age when available. We also removed skin cancer records (ICDA-8 is 
173) since this is known to have multiple occurrences. 
Of the remaining records, we found those cases where all the following 
occurred: 

- birthyear or calculated birthyear matched exactly, 

- surname matched exactly, 

- first four letters of first given name agreed, 


- the three digit ICDA-8 code agreed. 
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Of these records, we identified multiple registration as follows: 
- year, month and day of birth was present and agreed, or 
- day of birth was not present on at least one record but month of birth 


agreed. 


We also manually verified all groups with at least 3 individuals where the 
month and day information did not agree and all groups of 2 individuals where 


the month or day information was missing on at least one record. 


In all, this resulted in identifying 6113 records which where potential dupli- 
cates. These records correspond to 5947 individuals. (Note that some indivi- 
duals were duplicated more than once.) We did not make the judgment as to 


whether these were legitimate multiple registrations or actual duplicates. 


For each 3-digit ICDA-8 value, we show in Table 7 the breakdown of these 
potential duplicates according to whether the records came from the same 
reporting registry or different registries, as well as how many potential 


duplicates have the same fourth digit of the ICDA-8 classification. 
3. DISCUSSION 
3.1 Mortality - Incidence Ratios (Tables 1 and 2) 
ea 


Ratios of cancer mortality to incidence can provide an indication of complete- 
ness of registration. The ratios will vary with cancer site (the highest 
ratios occur for sites with the lowest Survival), age and sex for all regis- 
tries. However, if a comparison of the ratios for different registries shows 
major differences within a given site, sex and age group, differences in com- 
pleteness of registration of new cases of cancer must be suspected. A higher 
ratio, which means a higher proportion of deaths compared with newcases in the 


Same period may indicate less complete registration of new cases. 


In both time periods there were two registries which consistently had the 
highest ratios for all sites combined and for most of the major sites shown. 
There is little doubt that these high ratios do reflect underregistration of 


new cases - the registries are the only ones which do not use death 
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notifications as one of their sources of registration. In addition, one of 
these registries uses only a single data source, hospital reports, to register 
cancer cases. This registry had previously reported the results of a special 
study which showed that it was receiving notifications for only an estimated 
70% of new cancer patients admitted to hospitals up to the end of 1976. Fol- 
lowing this, major changes were made to improve the notification system. 
Since 1977 the registry has been reporting a higher number of cancer cases 


which is reflected in a marked reduction in mortality-incidence ratios. 


All other Canadian cancer registries use multiple sources of registration 
which is considered essential to achieve good coverage and which could also 
have a positive impact on the completeness and quality of individual data 
items. The completeness of reporting of data items was examined in this study 
(see Subsection 2.3). However, it turns out that the one registry that uses 
only one data source actually ranks quite highly in terms of completeness of 


information for many data items. 


A possible drawback associated with using multiple sources of registration is 
that duplicate registration may result. However an analysis of multiple 
registrations for the same individual and the same cancer site does not bear 
this out. In general, registries using a larger number of different sources 
of registration do not have more multiple registrations for the same site than 


registries using fewer sources of registration. 


Cancer mortality-incidence ratios for the other six registries were more simi- 
lar to each other. For these registries there was no consistent pattern of 
one registry always having higher or lower ratios for all sites and both time 


periods. 


There are many factors that can influence variations in the observed ratios by 
cancer site. Factors which tend to result in less complete registration of 
new cases and therefore higher mortality-incidence ratios includes difficulty 
in diagnosing the cancer (e.g. in deep-seated organs) and lack of access to 
specific data sources (e.g. haematology reports confirming a diagnosis of leu- 


kaemia). 
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Factors which may lead to overregistration of new cases and lower ratios are 
mass screening programmes (which may lead to the inclusion of prevalent cases, 
especially, for slow-growing tumours), duplicate registration, inclusion of 
in-situ cases and inclusion of latent cancers discovered only at autopsy (this 
particularly affects cancer of the prostrate). In addition, differences in 
the accuracy of assignment of diagnosis or cause of death may lead to artefac- 
tual differences. For example, death certificates may state "cancer of the 
uterus, unspecified" or "leukaemia, unspecified" as the cause of death whereas 
a cancer registry will often have more precise information and will assign 
more precise codes [6]. An analysis of mortality-incidence ratios at the 


level of the more detailed diagnosis would therefore show gross discrepancies. 


Of the leading cancer sites that were examined for males, cancer of the lung 
and stomach were associated with the highest mortality-incidence ratios for 
all registries but interprovincial variation in the ratios was greatest for 
cancer of the colon (excluding rectum) prostate and bladder. Use of cancer 
incidence data in studies designed to identify differences in cancer risk by 
geographic area (province) would therefore be more reliable for the former two 


cancer sites. 


For females, of the leading sites examined, cancer of the colon and ovary had 
the highest ratios for all registries. Interprovincial variation in the 
ratios was greatest for cancer of the uterus and cervix uteri as well as for 
cancer of the colon. For purposes of interprovincial comparisons, incidence 
data for breast cancer and cancer of the ovary would therefore be more relia- 


ble. 


In the case of the sites of cancer of the uterus (other than cervix) and can- 
cer of the cervix, there are large interprovincial variations in the ratios if 
the sites are considered separately. This variation is greatly reduced if the 
two sites are combined, suggesting that there are differences in the accuracy 


of diagnosis and cause of death assignment for these sites. 


The site-specific mortality-incidence ratios were examined for major age 


groups. The highest ratios consistently occur at older ages (65 and over) for 
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all registries and all sites shown. This is as expected, since the risk of 
death increases with age so that proportionately more death than new cases 
occur at older ages. It is also recognized that diagnosis and registration of 
cancers in older persons is generally more difficult. However, the relative 
increase in the ratios at older ages is much greater for the registries which 
have the highest average ratios to start with. This indicates that while all 
registries may have some difficulty in registering older patients, under- 
coverage of the older population is greater for registries which in general 


have less complete registration systems. 


The Canadian data therefore lend support to recommendations made by the 
International Union against Cancer (1970) and the International Agency for 
Research on Cancer (1976) and the reiteration of this recommendation in a 
recent paper by Doll and Peto [7] that "reasonably reliable comparisons of 
cancer incidence are obtained only if comparisons are limited to men and 


women in middle life". 
3.2 Matching Mortality and Incidence Records(Tables 3 and 4) 


Since Statistics Canada is responsible for managing both the cancer incidence 
and the mortality data files, it is possible to compare reports for indivi- 
duals who are listed in the two separate data files to verify the reported 


information. 


In this part of the investigation of data quality, the accuracy of assignment 
of diagnosis and cause of death was of particular interest. Within the scope 
of the study it is only possible to describe the results - the reasons under- 
lying the discrepancies found remain unknown. However, it is felt that the 
findings are revealing and do indicate, in the case of the particular cancer 
selected for analysis, lung cancer, that agreement on diagnosis between the 


two files is generally high, over 90%. 


The study also indicates that a larger scale match would be feasible to assess 
the accuracy of diagnosis and cause of death codes. Of course, if a larger 
scale study were based on a sample, it would be preferable to stratify the 


sample by year of diagnosis or year of death. In theory, if computerized 


= 65) = 


matching techniques were used, this type of analysis is possible for all can- 
cer sites. If such @™ undertaking were to be supplemented with, for example, 
studies on accuracy of coding of cause of death and diagnosis in the field, 
such as described in two U.S. reports [8] [9], interpretation of epidemiolo- 


gical research findings would be facilitated. 
3.2.1 Mortality to Incidence File Search 
SO a atl bleed) A 


Of the sample of 504 death records with an underlying cause of death of lung 
cancer from 1969-1978, 415 (82%) corresponding records on the cancer incidence 
file for the years of diagnosis 1969-1978 were found>. The rate of unsuc- 
cessful matches varied from 3.6% to 26.8% across the nine provinces. This 
rate is influenced by four main factors: (a) that the cancer was first dia- 
gnosed prior to 1969, (b) that it reflects underregistration of new cases, (c) 
that identifying information was not adequate to permit matching of records, 
or (d) that the cause of death code is incorrectly given as cancer. There 
were insufficient data to allow assessment of the relative contribution of 


each of these factors. 


Of the 415 death records for which a corresponding record was found on the 
incidence file, there was agreement on the diagnosis, primary cancer of the 
lung, in 92.5% of cases. There was 95.2% agreement that a cancer of the res- 
piratory system was present. The small number of remaining records, had dia- 
gnoses for sites other than respiratory cancer on the incidence file. It is 
generally accepted that the diagnostic information on cancer registry files is 
more accurate than the cause of death information on death certificates. How- 
ever, given the scope of this study it is not possible to determine if mis- 
Classification on either of the files (or perhaps the fact that a lung cancer 
was first diagnosed prior to 1969 followed by a subsequent registration for 
another primary cancer) account for the disagreement. Across the provinces, 


the rate of agreement on diagnosis varied from 74.4% to 100%. 


 .. 

P In six cases more than one corresponding record for the same individual 
existed on the incidence file. Only one of these records was counted as a 
successful match. 
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Interestingly, for the province with total agreement on diagnosis there was 
also the highest success rate (96.4%) of locating corresponding records on the 
two files. This could possibly indicate close liaison’ between the provincial 
vital statistics office and the cancer registry. In the reverse match of a 
sample of cancer incidence records to mortality records (described in Section 
3.2.2) it was the same province that had the highest rate of successful 


matches as well as complete agreement on diagnosis. 


3.2.2 Incidence to Mortality File Search 


The reverse search using the incidence file for the years 1969-1978 as a star- 
ting point and attempting to locate a corresponding record on the complete 
mortality file for the same period was successful for 69.4% of the selected 
sample of 504 records with a diagnosis of primary lung cancer. The rate of 
unsuccessful matches was higher than in the match from mortality to incidence 
records for all provinces and varied from 19.6% to 44.6%. Possible reasons 
for not finding a match include (a) that the patient was still alive at year 
end 1978, or (b) that identifying information was not adequate to permit a 
match. It is in general less likely that one will find a corresponding record 
in the search from incidence to mortality file since some persons diagnosed to 
have lung cancer do survive this whereas all persons who die from lung cancer 


should be registered as new cases either prior to death or at time of death. 


For the records that were successfully matched there was agreement that the 
diagnosis was a primary lung cancer in 91.4% of cases, a rate very similar to 
that found in the reverse comparison. The samples for the two comparisons 
were chosen independently so the consistency of the findings concerning agree- 
ment on diagnosis is reassuring. Of the remaining cases, 4.3% had the under- 
lying cause of death classified to cancer sites other than the respiratory 
system, and 3.7% had an underlying cause of death which was not cancer. For 
this latter group it is possible that cancer was mentioned on the death certi- 
ficate as a contributing cause of death. This analysis is possible but was n- 


not carrier out. 
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3.2.3 Availability and Completeness of the Data (Tables 5 and 6) 
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One measure of quality and usefulness of the data files is the frequency of 
valid information for specific data items. This measure is crude because 
"valid" as defined here means valid according to computerized edit checks and 
does not preclude that imputation of missing information or errors in defini- 


tion or classification of the data item render the information invalid. 


Subject to the above caveats the measure may be useful in showing if and where 
there are improvements in reporting of data items over the years, and whether 
or not particular analyses of the data are feasible. For example, information 
on date of birth is important for purely statistical (age-specific) analysis 
of the data as well as for medical follow-up analyses which depend on good 


identifying information. 


On the cancer incidence file, a complete birthdate (i.e., day, month and year ) 
1S on average present on only 68% of the records in the period 1969-1978. If 
the two time periods, 1969-1973 and 1974-1978 are considered separately some 


improvement in the more recent period becomes evident. 


On cancer mortality records for the same time period a complete birthdate is 
present in over 95% of cases. However, at least part of this high rate is due 
to the fact that the mortality system imputes a date of birth from age and 
date of death when the birthdate is not reported. In 1976 the imputation rate 
was 11.5% [5]. No such imputations are carried out in the cancer incidence 


system®, 


Small area analyses of cancer occurrence require complete and detailed 
Tesidence information. Cancer mortality data are much more useful for these 
purposes because census division (county) of residence codes are present on 
99.8% of records and census subdivision (city, town, village) codes are 
present on 96.2% of records. In contrast, on the cancer incidence file, 
census division codes are present on 89.6% of records and census subdivision 


Imputations may be useful for statistical purposes but are actually 
mis-leading in medical follow-up studies unless it is made clear that the 
information is based on an imputation. 
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codes on only 25.2% of records. On the incidence file, there is improvement 


in the reporting of census subdivision information in the second time period. 
3.2.4 Multiple Registrations on the Cancer Incidence System (Table 7) 


Comparability of cancer incidence data is affected if there are differences in 
the reporting of multiple primary cancers in the same individual and in inad- 


vertent duplicate registration. 


The rules for reporting of multiple primary cancers are difficult to inter- 


pret, so some provincial differences in their application are expected. 


Inadvertent duplicate registration of the same cancer incident may occur if a 
provincial registry cannot determine if the same case has been registered pre- 
viously (perhaps because identifying information is inadequate) or if the same 
incident is reported by two different registries’. The search for multiple 
primary cancers was restricted to multiple entries for the same cancer site 


(at the 3-digit level of the ICDA code). 


No attempt was made to separate duplicate registrations from multiple prima- 
ries, although it can be speculated that the majority of cases reported by two 
seperate registries may be duplicates whereas the cases reported by the same 
registry are more likely to be valid multiple primaries, particularly those 
that differ in the 4th digit of the diagnosis code. 


Using very strict matching criteria and excluding skin cancers (other than 
melanoma of the skin), 1.7% (6113) of records on the 1969-1978 cancer inciden- 


ce file were identified as multiple entries. 


By province, this rate varied from 0.5% to 1.9%. Only 0.4% of multiple 
records were reported by two different registries. There was agreement down 


to the 4th digit level of the diagnosis code in 88.4% of cases. 


U The national Cancer incidence System does not carry out routine checks on 


such duplication. 
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By cancer site, if only sites with more than fifty records are considered, the 
rate of multiple primaries varied from 1.0% for cancer of the stomach and pan- 
creas to 3.6% for breast cancer. The high rate for breast cancer is not 
surprising since the current rules for reporting of multiple primary cancers 


require separate reports for cancers in both sides of (most) bilateral organs. 


On the whole it is felt that while there is some inconsistency arising from 
multiple primary and duplicate reporting, this is very small compared with 


that arising from undercoverage. 


4. SUMMARY 


The techniques described in this paper have been successful at identifying 
differing levels of quality of cancer incidence and mortality data. It has 
been found that the mortality-incidence ratios, in particular, can be used to 
assess coverage errors, which are one of the major concerns of a high quality 
cancer incidence system. The data quality for those who are registered on the 
incidence system is sufficiently high that it is possible to assess the quali- 
ty of the cause of death classification on the mortality system through a 
micro-data match. In fact a computerized micro-data match could be used to 
evaluate the undercoverage because the number of cancer deaths without pre- 


vious registration on the NCIR system could be ascertained. 
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Table 1 


Cancer Mortality - Incidence Ratios 


Deaths in Period (Mortality) as Percentage of New Cases Registered (Incidence) 
Canada (excluding Ontario) and Provinces with the Highest and Lowest Ratios. 


Leading Sites by Age Group and Sex 
1969 - 1973 1974 - 1978 


Canada Highest Lowest 
Rat Lo Ratio 


Canada Highest 


Ratio 


Prostate 
(185) 


Bladder 
(188) 


All Cancers 
(140-209) 
excluding 
skin (173) 


- Either mortality, incidence or both are zero. 
* Ratios based on fewer than 10 cases for both mortality and incidence. 


Table 2 


Cancer Mortality - Incidence Ratios 


Deaths in Period (Mortality) as Percentage of New Cases Registered (Incidence ) 
Canada (excluding Ontario) and Provinces with the Highest and Lowest Ratios. 


Females Leading Sites by Age Group and Sex 


1969 - 1973 1974 - 1978 
Cancer Age 


Sate Canada Highest Lowest Canada Highest Lowest 
Ratio Ratio Ratio Ratio 


Uterus 
(182) 


Cervix Uteri 
(180) 


All Cancers 
(140-209 ) 
excluding 
skin (173 


- Either mortality, incidence or both are zero. 
* Ratios based on fewer than 10 cases for both mortality and incidence. 
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Table 3 


Match of Mortality with Incidence Records 


Disease Classification on the Cancer Incidence File for a Sample of Lung Cancer Deaths 
(Percentage Distribution) 


Canada (excluding Ontario) and 
Provinces with the Highest and Lowest Rates of Lung Cancer Incidents among the Matches. 


CLASSIFICATION ON THE INCIDENCE FILE CANADA HIGHEST LOWEST 


CANCER OF THE RESPIRATORY SYSTEM 
162.1 (a) LUNG; Primary 
160-163 (b) OTHER RESPIRATORY SYSTEM; Primary 


197.0-197.3 (c) RESPIRATORY SYSTEM; Secondary 


OTHER CANCERS 
(a) BREAST 


(b) LYMPHATIC PHEMATOPOIETIC SYSTEM 
200-209 Primary 


196 Secondary 
OTHER SPECIFIED PRIMARY SITE 


19D 199) ILL DEFINED OR UNDEFINED SITE 


SAMPLE SIZE FOR MATCHES 


MATCH SUCCESS RATE (%) 


Table 4 


Match of Incidence with Mortality Records 


Cause of Death Classification for a Sample of Lung Cancer Cases from the Cancer Incidence File 
(Percentage Distribution) 


Canada (excluding Ontario) and 
Provinces with the Highest and Lowest Rates of Lung Cancer Deaths among the Matches. 


ICDA-8 CLASSIFICATION ON THE MORTALITY FILE CANADA HIGHEST LOWEST 


1. CANCER OF THE RESPIRATORY SYSTEM 


162.1 (a) LUNG; Primary 
160-163 (b) OTHER RESPIRATORY SYSTEM; Primary 
NES WSUS: (c) RESPIRATORY SYSTEM; Secondary 


2. OTHER CANCERS 


174 (a) BREAST 
(b) LYMPHATIC PHEMATOPOIETIC SYSTEM 
200-209 Primary 
196 Secondary 


(c) OTHER SPECIFIED PRIMARY SITE 


199g (d) ILL DEFINED OR UNDEFINED SITE 


3. NOT CANCER 


SAMPLE SIZE FOR MATCHES 


Table 5 


Cancer Incidence 


Availability and Completeness of Data Items 


Canada (excluding Ontario) and 


Provinces with the Highest and Lowest Percentages of Data Completeness 


DATA ITEM 


DATE OF BIRTH 


Day 


Complete Birthdate 


BIRTHPLACE 
(Country or Province) 


RESIDENCE 
Census Division 


Census Subdivision 


DIAGNOSIS CANADA HIGHEST LOWEST 
PERCENT PERCENT 


Table 6 


ey 


Cancer Mortality 
Availability and Completeness of Data Items 


Canada (excluding Ontario) and 
Provinces with the Highest and Lowest Percentages of Data Completeness 


DATA ITEM CANADA HIGHEST LOWEST 
PERCENT PERCENT 


DATE OF BIRTH 


Day 


Complete Birthdate 


BIRTH PLACE 
(Country or Province) 


RESIDENCE 
Census Division 


Census Subdivision 
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Table 7 


Cancer Incidence 
1969 - 1978 
Multiple Primaries Within Each Site 
(Canada excluding Ontario) 


Multiple Primaries 


ICDA Cancer Site Percent Percent Percent 
of Same With Same 
Incidence Registry 4an Digit 


ICDA Code 
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Total All Sites (except skin, 173) 


140 Lip 

141. Tongue 

142 Salivary Gland 
143 Gum 

144 Floor of Mouth 
145 Mouth, Other and Unspecified 
146 Oropharynx 

147 Nasophar ynx 

148 Hypophar ynx 

149 Pharynx, Unspecified 


. 
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150 Oesophagus 

151. Stomach 

152. Small Intestine 
153 Lge. Intestine Excl. Rectum 
154 Rectum 

155 Liver 

156 Gall Bladder 

157 Pancreas 

158 Peritoneum 

159 Unspec. Digestive Organs 


WSO ION ENO One 


Oise Gi seen sees es 


160 Nose, Etc. 

161. Larynx 

162 Trachea, Bronchus, Lung 
163 Resp. Organs, Other & NOS 


OpmeNO 
So Gieche 
NOON 


170 Bone 
171 Connective Tissue 
172 Melanama of Skin 
174 Breast 
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Table 7 (concl'd) 


Cancer Incidence 
1969 - 1978 
Multiple Primaries Within Each Site 
(Canada excluding Ontario) 


Multiple Primaries 


ICDA Cancer Site Percent Percent Percent 
Number of Same With Same 
Incidence Reqistry 4am Digit 
ICDA Code 


Cervix Uteri 
Chorionepithelioma 

Other, of Uterus 

Ovary, Etc. 

F. Genital Organs, Other 
Prostate 

Testis 

M. Genital Organs, Other 
Bladder 

Urinary Org., Other & NOS 
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Eye 

Brain 

Other Nervous System 
Thyroid Gland 

Other Endocrine Glands 

Ill - Defined Sites 

Sec. & Unspec. Lymph Nodes 
Sec., Resp. & Digestive 
Other Secondary 

Without Spec. of Site 


° . 
WeFWWDDUWULF DH 


jae) ee ee) Sim) 
. ° 


Lymphosarcoma, Etc. 
Hodgkin's Disease 

Other of Lymphoid Tissue 
Multiple Myeloma 
Lymphatic Leukaemia 
Myeloid Leukaemia 
Monocytic Leukaemia 

Other & Unspec. Leukaemia 
Polycythemia Vera 
Myelofibrosis 
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SURVEY METHODOLOGY 1983, Vol. 9. No. 1 


ESTIMATING MONTHLY GROSS FLOWS 
IN LABOUR FORCE PARTICIPATION! 


Stephen E. Fienberg and Elizabeth A. Stasny* 


The Canadian Labour Force Survey is a household Survey conducted 
each month for the purpose of producing point-in-time estimates of 
the number of persons employed, unemployed and not in the labor for- 
ce. The survey has a rotating panel design in which all individuals 
in a sampled household location are interviewed each month, for six 
consecutive months. In the past, little use has been made of this 
longitudinal structure, although considerable interest has been ex- 
pressed in the month-to-month gross flows (transitions) amongst the 
labour force status categories. In this paper we discuss methods 
being considered by Statistics Canada for the production of gross 
flow estimates, but from a model-based perspective. 


1. INTRODUCTION 


The Canadian Labour Force Survey is a monthly household survey used to produce 
cross-sectional or point-in-time estimates of labour force participation. 
This survey, however, like the Current Population Survey in the United States 
and many other large scale sample surveys, is designed using a panel structure 
so that the subjects are interviewed a number of times before being dropped 
from the sample. Although the survey is used mainly to obtain cross-sectional 
estimates, it has long been recognized that information from the repeated 
interviewing of subjects provides an additional longitudinal data base that 
could be exploited to give estimates of change over time for a very small 
additional cost (see, for example Kalachek, 1979, and Fienberg and Tanur, 


1765). 
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The authors wish to thank Murray Lawes, Larry Swain and Richard Veevers for 
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Some attempts have been made to use the longitudinal data obtained from panel 
surveys. For example, the longitudinal data from the Current Population 
Survey has been used since 1948 to produce tables showing gross movements of 
individuals between labor force categories from one month to the next. 
Although these tables are produced each month, they have not been published 
since 1952 because of statistical problems. Smith and Vanski (1979) discuss 
the production of gross change data using the longitudinal structure of the 


Current Population Survey. 


Recently, Statistics Canada has initiated an investigation of possible uses of 
the longitudinal data available as a by-product from the Canadian Labour Force 
Survey. They, too, would like to find a method for producing reliable 
estimates of gross movements between labour force categories. In this paper, 
we discuss the methods being considered by Statistics Canada for the 


production of such gross change data. 


In Section 2, we give a brief description of the coverage and design of the 
Labour Force Survey, and we describe the structure of the resulting data. 
Then in Section 3 we outline the proposed method for gross flow estimation 
developed by Statistics Canada, which depends on the used of sample-based 
weights, adjustment for inflows to and outflows from the population of 
interest, consistency adjustments, and bias correction for misclassification 
error. By developing some simple models for the gross flow process, we 
explore in Section 4 the implications of Statistics Canada's proposed method. 
Finally, in Section 5 we describe some work on handling non-response in gross 


flow estimation. 


2. DESCRIPTION OF THE LABOUR FORCE SURVEY 


2.1 Survey Coverage 


Approximately 56,000 households, chosen from the ten provinces of Canada, are 


included in the Labour Force Survey sample each month. Questionnaires are 
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completed for all civilian, non-institutionalized members of sampled house- 
holds who are 15 years of age and older. The Survey questions primarily rela- 
te to the subjects' work related activities during the reference week which is 
the week prior to the survey week and usually contains the fifteenth day of 
the month. Responses to the survey questions are used to classify subjects as 
employed, unemployed, or not in the labour force. For a discussion on classi- 
fication of labour force status, see Guide to Labour Force Survey Data, 
Statistics Canada, (1979). 


2.2 Survey Design 


The labour Force Survey was designed to enable estimation of levels and rates 
of employment and unemployment for each of the ten provinces separately. 
Thus, except for the constraint on the total sample size, each province is 


sampled independently. 


Economic regions (ER's), areas of similar economic structure, form the primary 
strata within provinces. ER's are divided into self-representing units 
(SRU's) and non-self-representing units (NSRU's). SRU's are large urban cen- 
ters and NSRU's are generally composed of a small urban center and a rural 


area. Sampling is carried out separately in SRU's and NSRU's. 


SRU's are sampled using a stratified, two-stage sampling design. NSRU's are 
Sampled using a stratified multi-staged sampling scheme. In addition to the 
SR and NSR areas, some sample units are selected from an apartment frame and a 
special area frame. The final sampling units for the Labour Force Survey are 
households. A detailed description of the sampling plan for the survey can be 
found in Methodology of the Canadian Labour Force survey 1976, Statistics 
Canada (1977). 


Households selected for the Labour Force Survey are included in the survey for 
S$1x consecutive months and are then dropped from the sample. For example, 
households rotated into the survey in January are interviewed for six consecu- 


tive months, and then dropped from the sample after the June interview. Each 
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group of households that rotated into and out of the sample together makes up 
a panel. In any one month, individuals from six different panels are included 


in the Labour Force Survey sample. 
2.3 Sampled-Based Weights 


The cross-sectional data, information for a given month from all subjects in 
the six panels interviewed in that month, is used to produce monthly estimates 
of labour force participation. The monthly estimates are weighted averages of 
values for each person in the sample. A weighted average is used because each 
sampled person is thought of as "representing" a number of people in the popu- 
lation of interest. The weight assigned to an individual's record corresponds 
to the number of persons in the population that the person in the sample re- 


presents. 


Let Wy i be the weight assigned to surveyed individual i in the month t. If 
9 


individual i is classified as outside the population of interest in month t, 


then We i= 0. Otherwise, the assigned weights are determined by the probabi- 
9 


ty of selecting the cluster, the probability of selecting the household within 
the cluster, nonresponse within the month, rural/urban factors, sub-sampling 
adjustment for fast-growing areas, and ratio adjustments for prov ince/age/sex 


factors. 


An individual's assigned weight can change from month to month because of 
replacement of 1/6 of the sample each month, non-response, and, to a lesser 
extent, because of changes in the size of the population of interest. Thus, 


for any one individual, i, it may be the case that W Pa Were 
t-1,1 eee 


2.4 Longitudinal Structure and Gross Flow Estimation 
Although the main prupose for the Labour Force Survey is to produce point-in- 


time estimates of labour force participation, the panel structure of the sur- 


vey design results in a longitudinal data base, with approximately 5/6 of the 
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household locations sampled in any one month being in the sample for the fol- 
lowing month. Naturally, fewer than 5/6 of the Surveyed individuals or fami- 
lies are the same in two consecutive months due to non-response and moving. 
However, Statistics Canada is interested in the possibility of using informa- 
tion from those individuals who do respond in two consecutive months to produ- 


ce estimates of gross flows among labour force categories. 


Estimates of gross flows are useful for answering questions such as a) How 
much of the increase in unemployment is due to persons losing jobs and how 
much is due to persons formerly not in the labour force starting to look for 
jobs? or b) How many unemployed persons become discouraged and leave the 


labour force? 


We discuss the problem of estimating gross flows among labour force categories 


in the next two sections. 


3. STATISTICS CANADA'S PROPOSED METHOD FOR GROSS FLOW ESTIMATION 


In this section we describe the multi-stage estimation procedure for gross 
flows developed by Statistics Canada (e.g. see Macredie and Veevers, 1977; 
Wong, 1983). Our description of the procedure includes various interpreta- 


tions of the impact of individual stages. 
3.1 Data Needed to Estimate Gross Flows 
a Oe SLUSS th LOWS. 


Statistics Canada has proposed estimating gross flows using a 4x4 matrix as 


shown below: 


Sun0 


Labour Force Status in Month t 


Labour Force E 
Status in 
Month t-1 u 
N 
0 
where = employed 


B 
U = unemployed 
N = not in the labour force 
0 = outside the population of interest, and 

Xij = estimated number with labour force status i in month 


t-1 and status j in month t. 


The final monthly Labour Force Survey files from two consecutive months can be 
used to obtain the data for estimating the 4x4 matrix of gross flows. In or- 
der to use these data to produce gross flow estimates, Statistics Canada must 
match individual records from the two consecutive monthly files using the uni- 
que identification numbers assigned to sampled individuals for the duration of 


their time in the study. 


An individual appearing on the data file for one month may be missing from the 
file for the other month due to rotation into or out of the sample or because 
the person moved, was not at home, or refused to respond. The sample weights 
described in Section 2.4 include an adjustment for non-response within each 
month. When dealing with gross flows, we also need to consider the month-to- 
month non-response. Statistics Canada proposes to reweight records for indi- 
viduals who responded in both months t-1 and t to compensate for this addi- 


tional non-response. 
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After the reweighting is completed, Statistics Canada will have a single data 
file that includes information for all persons who were respondents in two 
consecutive months. That file will contain geographic and demographic infor- 
mation for each individual as well as the individual's labour force status and 


assigned weights for both month t-1 and month t. 
3.2 Differences in Weights 


As we noted in Section 2.3, an individual's sample-based weight can change 
from month to month because of the rotation replacement structure, non- 
response, and changes in the size of the population of interest. Even when the 
adjustment factor for non-response is computed on the basis of month-to-month 


t=1,i7 on - Some 


method is needed for handling this difference in weights if data are to be 


data, for any individual i, it may still be the case that W 


used for estimating gross flows. 


Statistics Canada proposes to resolve this dilemma by assuming that differen- 
ces in the two weights occur only as a result of inflows to and outflows from 
the population of interest. Thus, differences in weights are added to the ap- 
propriate cell in either the last row or last column of the gross flow matrix. 
This procedure depends heavily on the interpretation of the weights suggested 


in Section 2.3, namely that sample individual i represents We ; Persons in 
’ 


the population in month t. 


As an illustration of the procedure, suppose an individual is classified as 


employed in both months t-1 and t but Wy 14> 300 and Wy i= 305. The mini- 
mY) ? 
mum weight, 300, is added to the EE cell of the gross flow table. The diffe- 
rence, We robes We 14? of 5 is added to the OE cell since those 5 would be 
b] ele f 


thought of as having been outside the population of interest during the month 


t-1 and then having moved in the population as employed persons for month t. 
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If, on the other hand, the individual is employed in both months but We Bs 
ae ued 


305 and W = 300, then 300 is again added to the EE cell but the excess 


ae 
weight of 5 is added to the EO cell. Here, the difference between weights re- 
presents 5 persons who were employed in month t-1 and then moved outside the 


population of interest in month t. 


An individual who is classified as outside the population of interest in month 


t-1 and is then, say, employed in month t will have Wy 1 ay Bee Wy i =500); 
wih} 9 
then 300 is added to the O£ cell. Individuals classified as outside the popu- 
lation of interest in month t are treated similarly with We 1? being added 
Ue) 


to the appropriate cell in the last column of the gross flow matrix. 


Because persons outside the population of interest are assigned a weight of 
zero, a person who is classified as such in both months t-1 and t would 


have ie = O and Ved = 0. Therefore, Xag? the entry in the 00 cell of the 


gross flow matrix, must always be zero. 
3.3 Adjustment of the Inflow and Outflow Cells 


Adding differences in weights to the inflow and outflow cells of the gross 
flow matrix provides a method for handling the changes in sample-based weights 
from one month to the next and gives estimates of inflows to and outflows from 
the population of interest. Independent estimates of inflows and outflows, 
available from Census data, suggest that this method overestimates the actual 
amount of movement into and out of the population of interest. Thus, 


Statistics Canada plans to adjust the Xogs Xoy? Xon? Xeg? Xyig? and Xni0 entries 


in the gross flow matrix. These cells will be proportionally adjusted so that 
total inflows and outflows shown in the gross flow matrix equal the Census es- 


timates of inflows and outflows respectively. 


Let I be the independent census estimate of inflows to the population of inte- 


rest and F be the census estimate of outflows from the population. Call the 
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sum of estimated inflows X Xor + Xay + XON and the sum of estimated out- 


Dis 


flows X =X, 


+0 + X 


uo + Xvi: The proportionally adjusted inflows are: 


oP = Xo ,1/Xo4 rors} 22 PESUL Ne (1) 


The proportionally adjusted outflows are 


20 = X oF /X40 (OP Wo= BeUZNS (2) 


3.4 Consistency of Gross Flow Estimates With Monthly Totals 
$$$ ess itn Monthiy totals 


Statistics Canada would like their gross flow estimates to be consistent with 
the published monthly estimates of labour force participation totals. Thus, 
the row totals for the gross flow matrix must be the month t-1 estimates of 
labour force participation and the column totals must be the month t cross- 
sectional estimates. The marginal totals of the gross flow matrix constructed 


as described above are not consistent with the monthly labour force totals. 


Statistics Canada plans to use the method of iterative proportional scaling, 
originally proposed by Deming and Stephan (1940), and described in detail by 
Bishop, Fienberg, and Holland (1975), to adjust the gross flow matrix to agree 
with the monthly labour force totals. When used to adjust the gross flow ma- 
trix, iterative proportional Scaling alternatively 1) constrains the rows of 
the matrix to sum to the month t-1 estimates and then 2) constrains the co- 
lumns to sum to the month t estimates. Steps 1) and 2) are repeated until the 


entries in the matrix do not change from one step to the next. 


Testing at Statistics Canada has shown that cell changes resulting from the 
application of iterative proportional scaling were both absolutely and relati- 
vely small and fell roughly within the bounds of sampling variability associa- 
ted with the cells. This suggests that the consistency adjustment does not 


seriously distort the gross flow estimates. 
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3.5 Bias Correction for Misclassification Error 


Statistics Canada proposed method for estimating gross flows also includes a 
step correcting for misclassification bias. This is the bias that results 
from the incorrect assignment of an individual's labour force status. A tech- 
nique developed by Fred Wong (1983) at Statistics Canada uses reinterview data 


to correct for the misclassification bias. 


4. IMPLICATIONS OF STATISTICS CANADA'S PROPOSED METHOD 
4.1 Modeling Gross Flows 


Each step of Statistics Canada's proposed method described in the previous 
section is a logical attempt to correct problems that arise concerning the 
production of good estimates of gross flows. It is not clear, however, what 
effect the various adjustments have on the final estimated gross flow matrix. 
In order to better understand Statistics Canada's proposal to treat differen- 
ces in weights as being due to inflows to and outflows from the population of 
interest, in this section we develop a model for the gross flow process. Our 
discussion centers on the quantities in the inflow and outflow cells of the 
estimated gross flow matrix since the problems with Statistics Canada's propo- 
sed method seem to occur in those cells. Because the design of the Canadian 
Labour Force Survey is quite complex, we begin with a set of simplifying as- 


sumptions. In the following we assume that 


1. a single stage stratified sample is chosen, 
2. there is no response error, and 
3. mon-response occurs only because random individuals move between 


strata or because of rotation into or out of the sample. 
4.2 Allocation of Net Population Changes to Inflow and Outflow Cells 


Suppose that the population of interest is divided into S strata indexed by 
Sima Le Zoe eteoe AL ee 


18h. 


Ne = population size in strata s in month k. 


Each month, a simple random sample is chosen from each stratum for the survey 
and sampled individuals are interviewed for Six consecutive months before 
being dropped from the survey. Qur goal is to estimate gross flows from month 
t-1 to month t. 


For the purpose of estimating gross flows, only individuals who are inter- 
viewed in both month t-1 and t will be used. This excludes individuals who 


rotate into or out of the Sample and persons who move between strata. Let 


a = number of sampled individuals from stratum s who were 


interviewed in both months t-1 and es 


Each of the rt-1,t respondents in stratum s is assigned the following weights 


in months t-1 and t respectively for the purpose of gross flow estimation: 
Ss iS S Ss Ss 
Wemts=sNe 127 te_7,¢ and We = NE / rey (3) 


As long as movements between strata and selection of panels are "random pro- 
cesses," these weights represent the inverse of the probability that an indi- 
vidual within a stratum is interviewed in both months t-1 and. t.9) Since, all 
individuals within a stratum have the same weight in any given month, aggrega- 


tes for each stratum may be used. Therefore, we let 


ng = number of sampled individuals from stratum s classified as 
having labour force status i in month t-1 and status ir an 
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The methodology proposed by Statistics Canada requires that the minimum of the 
months t-1 and t weights for each individual be added to the appropriate cell 


in the gross flow matrix. The difference is added to the appropriate inflow 
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cell if the month t weight is greater than the weight in month t-1 and to the 
appropriate outflow cell otherwise. Thus, for example, the stratum s entry in 


the EE (employed to employed) cell of the qross flow matrix is: 


min(We_1,Wp)nee = min [(Ne- / réa,t)> (NE / Pe-1,t)] NEE 
= min (Ne_4, Ne) nee / Tt-1,t 


= min (N$_1, Ne) fee (4) 


where fer = fraction of all individuals from stratum s, 
interviewed in both months t-1 and t, who 


where employed in both months. 


The contribution from stratum s to the OE cell for the matrix from individuals 


employed in month t is: 


max(O,We - We_1)npe = max[0, (NE - Ne-1 / Tt-1,t)] NEE 


max(O,Np - Ney) nee / re-1,t- (5) 


Differences from individuals falling in the UE and NE cells will also contri- 
bute to the OF cell. Thus, the total contribution to the OE cell from stratum 


s is: 


max(O,NP-Ne_4) {(mee / reat) + (ne / re-1,t) + (nNe / TE-1,t)] 


max(0,Ne - Ne_1) NSF dh rf-1,t 


max(0,Np ~ Ne_1) for (6) 
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where for = fraction of all individuals from stratum S, 
interviewed in both months t-1 and t, who 


were employed in month t. 


We obtain total for all cells in the gross flow matrix in a similar manner. 


The resulting gross flow matrix is as follows: 


Gross Flow Matrix - Month t-1 to Month t 


Month t 


Month 
t-1 


S Ss 
aN 


Sy SS 
N )f 
s=1 -1 t NM 


Sess ss SESES 
N f N 
S= liter s=1 t + s=1 t +N 


Notice that each term in the summations for the nine in-population cells (the 
cells showing gross flows between employed, unemployed, and not in the labour 
force) is the product of the net size of the strata and the observed fraction 
of subjects who had the various labour force classifications in months t-1 and 
t. The out-of-population to employed cell of the gross flow matrix contains 
a sum of terms from each strata that grew from month t-1 to month t. Each 
term is the product of the net increase in Size for the strata and the frac- 
tion of the subjects in the strata who reported being employed in month t.The 


out-of-population to unemployed and not in the population of interest cells 


oe OF - 


contain the sums of similar terms except that the net increase in size for 
each strata is multiplied by the fraction of subjects in the strata who were 
unemployed or not in the labour force in month t respectively. In other 
words, the net increase in size for each strata is proportionally allocated to 
the three inflow cells of the gross flow matrix based on the observed frac- 
tions of employed, unemployed, and not in the labour force in month t. Simi- 
larly, the net decrease in size for each strata that shrank between months t-1 
and t is proportionally allocated to the outflow cells of the matrix based on 
the observed fractions of employed, unemployed, and not in the labour force in 


month t-1. 


In this model we have assumed that the only way for counts to appear in the 
inflow and outflow cells is as a result of differences in weights. In practi- 
ce, a small number of individuals who move in and out of the population-of- 
interest show up in the sample and their assigned weights are added to the ap- 
propriate inflow and outflow cells. The effect of such individuals on the es- 


timates is very small. 


The fractions For, ante atte feu fe and Fria are estimated using individuals 


who appear in the sample in both months. Almost all individuals classified, 
for example, as OF could not be respondents in both months because they are 
not sampled by design or because they are movers. Thus, these people who 
could not have been respondents in both months are represented by people who 
did respond in both months. To the extent that these groups differ, the pro- 
portional allocation of net increases and decreases in strata size may result 


in biased estimates in the inflow and outflow of the gross flow matrix. 
4.3 Effects of Movements Between Strata 


The weights used for the purpose of gross flow estimation, as shown in expres- 
sion (3), are determined by the number of respondents in both months t-1 and 
t, a quantity that remains constant for the two months, and by the stratum po- 


pulation. The population of a stratum changes if a) individuals enter from 
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outside the population of interest, such as when persons reach their 15th 
birthday or leave the full-time military, b) individuals move outside the po- 
pulation of interest, as when persons enter the military or an institution, or 
Cc) individuals in the population of interest move between strata. This sub- 
section describes the effects of such changes in population size on the quan- 


tities in the gross flow matrix. 


As in the preceding subsection, we suppose that the population of interest is 
divided into S strata. Again, individuals are sampled at random from each 
stratum every month, interviewed for Six consecutive months, and then dropped 
from the sample. Let Pecy be, as before the number of individuals from 


stratum s who are interviewed in both months t-1 and t. 


Next we suppose that there are Ne_ individuals in stratum s in month t-1. 


Let movements into and out of strata between months t-1 and t be denoted by 


My,y = Number of individuals who move from u to V, U # vy, between 


interviews for months t-1 and t where u and v may take 
on the values. 

S-=.stratumys fotes/ 4152 j0s%%n S. and 

0 


outside the population of interest. 


Using this notation, the population in stratum s in month t is 
S cS 
Ne = Ne_4 = Bigs (mus = Ms yu)s ve 


The weights assigned to individuals in stratum s in months t-1 and t respecti- 


vely are 


S S Ss S 
We-1= NEL / Tt1,t and We = Ne_y + Eee Mise Mama )7 TE ¢. (8) 
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Since our focus in this section is on movement into and out of the population 
of interest, it is not necessary for us to divide those in the population of 
interest into employed, unemployed, and not in the labour force. Thus, the 
gross flow matrix used here is a 2%2 matrix formed by collapsing the first 
three rows and columns of the 4x4 gross flow matrix used in the preceding sub- 


section. 


The entry for stratum s in the in-population to in-population cell is 


4 s} Ss S 5 S SS} Ss Ss S 
min(We_1 WE )re—7,t = min {Ne_1/ rt_1,t» [Nt-1+ Eee (Mu,s-Ms,u)]/te-1,t IPt-1,t 


eres S 
min[N¢_4, Nt_4 + bees (mus - Ms,u)] 


Ss ; 
Ne_1 + min[0, yee (Mus - ieperrpill (9) 


The entry for stratum s in the out-of-population to in-population, or inflow, 


cell is 


S S Ss S 
max(O,WE - WE_q)re—1,¢ = max[9,Z 40 (Muys - Ms,u)/TE-1,t] Tt-1,t 


= max [0,5 46 (mu,s - mig u) |. Ct) 


The entry for the in-population to out-of-population, or outflow, cell is 


found similarly. Thus, the 2x2 gross flow matrix is as follows: 


a 


Month t 
In-Population Out -of-Population 
Month 1 
t-1 population 
Out-of- 
population 


s 
INE_1 i Luts (My s a Ms u)} 


S 
y 


s=1 


Let us consider the quantity in the inflow cell of this gross flow matrix. 
This cell should contain the net increase in population from outside the po- 


pulation of interest, Mm = mM. 3 for each stratum that gained members from 
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O,s 


Outside the population. What the cell does containals «2 g (Mu,s - Ms ,u) for 
u#s 


each stratum s that grew as a result of movements between strata and from out- 


Side the population of interest. The summation, ah (mus - Ms,u)> does in- 
clude the quantity M0,s - ™g,g but it may also contain other terms. 
For example, suppose the population is made up of three strata called A, B, 


and C. If strata A and B grew from month t-1 to month t and stratum C lost 


members, then the inflow cell contains 


i 
=} 
(=) 
D> 
! 
S} 
> 
j=) 
+ 
=) 
© 
> 
! 
3 
D> 
ee) 
+ 
S 
(op) 
> 
! 
= 
> 
(op) 


uta A Mau? * Zua5 (Bm Mg) 
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Note that the movements between strata A and B cancel out but the terms show- 
ing the movement between strata A and C and strata B and C remain in the sum- 


mation. 


In general, the inflow cell contain extra terms of the form my y-My,y for each 
stratum v that loses population while stratum u gains population. Similarly, 
the outflow cell contains extra terms of the form My sy-My x for each stratum y 


that grows while stratum x loses population. 


In the inflow cell, the quantity tg (Mu,s-™s,u) for each strata s that gains 


population from month t-1 to t will be positive, although each individual term 


in the summation need not be positive. If 


ae (mu,s - ™s,u) > 0,5 - ™s,0 (12) 


then the contribution to stratum s is more than the inflow to stratum s_ from 
outside the population of interest. This excess comes from terms of the form 


Mu,v — ™y,u as described above. That is, the overestimate is due to movements 


between strata within the population. A similar result holds for the in- 


population to out-of-population cell of the matrix. 


Statistics Canada staff report that the method they proposed for handling dif- 
ferences in weights from month to month does appear to give overestimates in 
the inflow and outflow cells of the gross flow matrix. Although they are 
based on simplifying assumptions, the results here give a possible explanation 
for the overestimation, i.e. the overestimation may be due to movements within 


the population of interest. 


Finally we note that, in the 2x2 gross flow matrix shown above, the in- 
population to in-population cell must contain an underestimate equal to the 
overestimate in the outflow cell. Whatever the amount of underestimation, it 
is spread over the nine in-population to in-population cells in the 4x4 gross 


flow matrix. Moreover, the size of the overestimation is small in comparison 
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to the total size of the nine in-population cells. 
4.4 Comments on the Proposed Gross Flow Estimation Method 
ee ow EStimation Method 


The results described in the preceding two subsections illustrate problems 
with the proposed method of handling month-to-month differences in weights for 
the purpose of gross flow estimation. These results do not come as a surprise 
to Statistics Canada. Because of their experience with Labour Force Survey 
methods and data, they realized that the movements in individuals within the 
population might explain some of the overestimation in the inflow and outlfow 
cells of the gross flow matrix. The results obtained by modelling the process 
reinforce their beliefs and make it clear just how the movements of indivi- 
duals effects the estimates. In addition, the modelling brought to light a 
problem about which Statistics Canada had not been aware: the compensating 
underestimation spread over the nine in-population to in-population cells of 


the gross flow matrix. 


In section 4.2, we saw that the net increases in strata are allocated to the 
inflow cells while the net decreases are allocated to the outflow cells accor- 
ding to the fractions of observed individuals classified as employed, 
unemployed, and not in the labour force in month t~ and month t-1, 
respectively. The estimation of inflows and outflows in this manner is valid 
only if individuals who move in and out of the population of interest are a 
random sample of individuals and, hence, "the same" as individuals who remain 
within the population of interest. Sampled individuals who are classified as 
outside the population of interest appear in the sample by accident rather 
than by design; the Labour Force Survey is not designed to estimate numbers of 
persons outside the population of interest. If we want to obtain reasonable 
estimates for the inflow and outflow cells of the matrix, it may be necessary 
to include individuals outside the population of interest in the Labour Force 


Survey sample or to use a special, supplementary sample. 


In section 4.3, we saw that the overestimates in the inflow and outflow cells 
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could be a result of movements of individuals between a strata whose popula- 
tion grew and a strata whose population shrank. The fact that it was move- 
ments between strata that caused the problem is a result of the simplifying 
assumptions which we made. We assumed that the final sample was randomly cho- 
sen from within each strata. Hence, the weights assigned to individuals sam- 
pled from a single strata were equal. If, instead, we assumed that the strata 
had been divided into clusters and random samples of individuals had been cho- 
sen from within the clusters, then all individuals sampled from a single clus- 
ter would have been assigned the same weight and the overestimate would come 


as a result of movements between clusters. 


To correct for the overestimate, and corresponding underestimate, directly in 
the case where final samples are chosen at random from within strata, we would 
need estimates of the number of movers between each pair of strata where one 
strata grew and the other strata lost population. If the final samples are 
chosen at random from within clusters, similar estimates would be required for 
each pair of clusters. This is a considerable amount of information. A fur- 
ther complication is that, in practice, the ratio adjustments applied to the 
weights make it possible for individuals within one household to have unequal 


weights. 


As was suggested earlier, if individuals outside the population of interest 
were included in the sample, we could obtain estimates of movement into and 
out of the population of interest directly. One other possibility that should 
be considered is to discard the monthly weights for the purpose of gross flow 
estimation and derive a longitudinal weight for each individual in the Labour 


Force Survey sample in either of the two months. 


As statisticians, we are quite comfortable with estimates of gross flows that 
do not have the published monthly labour force participation totals as margi- 
nal totals; however, we realize the problems that might arise if gross flow 
estimates, that were not consistent with the monthly totals, were published. 
Nevertheless, it should not be assumed automatically that the monthly estima- 


tes are correct and that the problem lies solely in the gross flow estimates. 
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As we noted in section 3.5, the gross flow matrix is adjusted to correct for 
misclassification errors. The monthly estimates, however, are not corrected 
for misclassification bias. Thus, when iterative proportional fitting is used 
to adjust the gross flow matrix to agree with the monthly totals, the matrix 
is being altered to be consistent with biased values. We feel that it would 
be more appropriate to address the problem of misclassification of labour for- 
ce status in the monthly data where it occurs rather than just in the estima- 


tes of gross flows. 


5. NON RESPONSE AND GROW FLOW ESTIMATION 


Statistics Canada's proposed method for gross flow estimation compensates for 
non-response by adjusting the sample-based weights of respondents. This 
method of handling non-response is appropriate if the missing data are missing 
at random (e.g., see Rubin, 1976). In order to explore the assumption of 
random non-response, we used a longitudinal file for a single panel to produce 
Ehe “data “in Table 71: This table shows the unweighted percentages of 
individuals reporting being employed or unemployed in zero to six months 


according to the number of months in which they responded to the survey. 


Consider the probabilities underlying the observed percentages shown in part 
(a) of Table 1. Let 


™] = probability that an individual is employed in i out 
ofe6emonths'Chora = Oshs!. 0%. 


Under the assumption that non-response occurs at random, the probabilities 


Corresponding to the first column of that table can be written as 


P(observing 0 months employed out of 6-k months responding) 


é 
EOLA TES Gis) 


—/ 98) < 


Notice that these probabilities increase from the first row of the column to 


the last row. 


In a similar manner, it can be shown that, if data are missing at random, then 
the underlying probabilities must increase from the top to bottom of each co- 
lumn in both tables. The first column of each table deviates from this pat- 
tern quite noticeably. In both cases, the observed percentages decrease 
through the first four rows of the table and then increase in the last two 
rows. It does not seem likely that sampling variability alone could be res- 
ponsible for such a pattern in both tables. Thus, it appears as if there is 


some evidence that non-response does not occur at random. 


Of course, the above analysis is based on just a single panel of Labour Force 
Survey data. However, in a larger study using data from 1980 and 1981, Paul 
and Lawes (1982) also found evidence of a relationship between employment sta- 
tus and non-response. Therefore, there is a need to consider methods for 
gross flow estimation that do not require the assumption that non-response oc- 


curs at random. 


Statistics Canada's proposed method for gross flow estimation only makes use 
of the information from individuals who responded in both of the months. 
There is also information available from those individuals who responded in 
just one of the two months. Stasny (1983) presents a method for month-to- 
month gross flow estimation that makes use of the information available from 
individuals who are respondents in only one of the two months and that can be 
used when non-response is related to time or employment status. For this 
method, we take the observed gross flow data to be the end result of a two- 
stage process. In the first stage of the process, which we do not get to ob- 
serve, individuals are allocated to the sixteen cells of the gross flow matrix 
according to a single multinomial sampling scheme. Then, in the second stage, 
each individual may lose either the month t-1 or month t labour force 
classification with some probability. The probability of losing a month's 
classification can be modeled to depend on the month, or labour force status, 


or both. Maximum likelihood estimates for the parameters of the multinomial 
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distribution of the first stage and the probabilities of losing a month's 


classification are obtained using iterative methods. 


When these models were fit to Labour Force Survey data from a single panel, 
Stasny (1983) found that the model where the probability of losing a month's 
classification depends on labour force status provides a reasonable fit to the 
data for all gross flow matrices with the exception of the months 1-2 matrix. 
For the data from month 1 to month 2, the probability of losing a month's 
Classification appears to depend on the month. This may be due to the fact 
that there is higher non-response in the first month a panel is in the 
survey. We believe that it would be worthwhile to fit this type of model to 
additional data from the Labour Force Survey to see if similar results are ob- 


tained over other panels. 


Clearly, the problem of obtaining good estimates of gross flows from the 
Labour Force Survey is not a simple one. The survey is designed to give data 
for the production of monthly estimates of labour force participation, not es- 
timates of gross flows. A survey designed specifically for the purpose of es- 
timating gross flows among labour force categories would certainly be diffe- 
rent from the current Labour Force Survey. Thus, the longitudinal data from 
the survey is not ideal for gross flow estimation. The data, however, are 
available and, if they can be used to give reasonable estimates of gross 
flows, then additional, useful information is produced for a relatively small 


cost. 
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REDESIGN OF THE NIAGARA TENDER FRUIT 
OBJECTIVE YIELD SURVEY 


ae Kovar } 


The peach, sour cherry and the grape objective yield surveys have 
been carried out annually in the Niagara Peninsula since 1964 in 
order to forecast the magnitude of change in marketable fruit pro- 
duction from the previous year. Timeliness of the estimates is 
essential in order to enable the Ontario Tender Fruit Growers 
Marketing Board (OTFGMB) and the Ontario Grape Growers Marketing 
Board (OGGMB) to establish the marketing strategies well ahead of 
the harvest. This paper summarizes the major changes due to the 
second redesign initiated in 1982. In particular, the sample de- 
Sign, data collection operation and modifications of the est ima- 
tion procedures are elaborated upon. 


1. INTRODUCTION 


The decision to switch from a list frame to an area frame survey was made in 
the first redesign in 1974 primarily due to the lack of an adequate list of 
commercial growers in the Niagara Peninsula. However, in 1981 the Ontario 
Ministry of Agriculture and Food (OMAF) has conducted a Tree Fruit Census and 
a Grape Vine Census. The availability of the census data makes it possible to 
redesign the survey for the second time in order to reflect the changes that 
the industry has undergone in the last eight years. Based on discussions with 
OMAF, it was decided that the census lists of growers are complete and accu- 
rate and that they contain sufficient information to form the sampling frame 
for the Tender Fruit Surveys. As a result, the peach, sour cherry and grape 
Surveys will be conducted employing three independent samples selected from 


the 1981 OMAF census lists. 


The object of all three surveys is to forecast the total amount of fruit ac- 


tually sold (as fresh fruit or to processors). These forecast are made by 


aos Kovar, Business Survey Methods Division, Statistics Canada. This work 
was done while the author was in the Institutional and Agriculture Survey 
Methods Division, Statistics Canada. 
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estimating a ratio of the number of pieces of marketable fruit in the current 
year to the corresponding total for the previous year and applying this ratio 
to the previous year's figure of actual amount sold reported as a tonnage by 
the Ontario Fruit and Vegetable Statistics Committee. Thus an assumption of 
high correlation of fruit weight and fruit count must be made. Secondly due 
to the time lag between the surveys and the harvest, it has to be assumed that 


any loss of fruit between these two times is consistent from year to year. 


2. OVERVIEW OF THE SAMPLE DESIGN 


The samples of the three objective yield surveys were selected independently 
according to a multistage, stratified (by geographical region), replicated, 
pps (farms and orchards/vineyards were selected with probability proportional 
to size), nearly self-weighting (all trees/vines have an approximately equal 
probability of selection) sample design. Figure 1 provides a visual summary 
of the sampling strategy. Note that due to the fact that the weight variables 
are collected at various points in time, the design is not exactly self- 


weighting. 


2.1 Target Population, Sampling Frames and Total Sample Size 


The target population for the three objective yield surveys comprises all com- 
mercial growers of the respective fruit in the Niagara Peninsula. Commercial 
growers for the three surveys were defined by OMAF as operators of those 
holdings which reported more than 200 peach trees, 200 sour cherry trees or 
5000 grape vines respectively in the 1981 Tree Fruit or Grape Vine Censuses. 
Using the above definition, a separate frame was created for each of the three 
surveys. The lists for the peach, the sour cherry and the grape surveys con- 
tain 423, 145 and 552 commercial growers respectively. The total sample size 
(number of orchards/vineyards to be enumerated) for each survey was determined 
by OMAF's budget constraints to be in the neighbourhood of 60 for the peach 
survey, 55 for the sour cherry survey and 155 for the grape survey. Since for 
the grape survey all available varieties of interest are to be sampled on a 
selected farm, the final sample size for the grape survey is unknown. How- 
ever, based on the 1981 Grape Vine Census, it is estimates that 62 farms will 


generate a sample of approximately 155 vineyards. 
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FIGURE 1: Sample Design for the Tender Fruit 
Objective Yield Surveys 


SAMPLING FRAME: 


ALL COMMERCIAL GROWERS IN THE FOUR 
REGIONS ENUMERATED IN THE 1981 
OMAF CENSUSES 
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STRATIFICATION 
& ALLOCATION: 


STRATIFY BY REGION WITH COMPROMISE 
OF OPTIMAL ALLOCATION OF SAMPLE 
SIZE TO THE 4 STRATA 
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IN EACH STRATUM, SELECT A REPLICATED, 


1st pps SYSTEMATIC SAMPLE OF FARMS BASED 
STAGE: ON THE NUMBER OF TREES/VINES IN THE 
1981 CENSUS 
2nd FROM EACH SAMPLE FARM, SELECT AN 
STAGE: ORCHARD/VINEYARD, pps BASED ON THE 
NUMBER OF TREES/VINES. (FOR THE 
GRAPE SURVEY SELECT ONE SUCH VINEYARD 
FOR EACH OF THE THREE VARIETIES GROWN 
ON THAT HOLDING) 
3rd SELECT A SIMPLE RANDOM SAMPLE OF 
STAGE: 4 TREES/5 VINES ON EACH OF THE SELEC- 
TED ORCHARDS/VINEYARDS 
SOUR CHERRY SURVEY PEACH SURVEY GRAPE SURVEY 
4th 
STAGE: SELECT A SAMPLE LIMB(S) SELECT A SIMPLE 
WITH PROBABILITY PRO- RANDOM SAMPLE OF 
PORTIONAL TO CROSS- 5 BUNCHES OF 
SECTIONAL AREA ON GRAPES ON EACH 
EACH SELECTED TREE SELECTED VINE 
DATA 
COLLECTION: COUNT ALL MARKETABLE FRUIT ON ALL 


SELECTED SOUR CHERRY LIMBS/PEACH 
TREES/GRAPE BUNCHES 
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2.2 Stratification and Sample Size Allocation to Regions 


The Niagara Peninsula was divided into four regions for which separate estima- 
tes are required. These were defined as follows (based on the 1976 Census 


boundaries): 


Region 1: Town of Grimsby in the Niagara Regional Municipality and township of 
Saltfleet in the Regional Municipality of Hamilton-Wentworth. (Town- 
ship 8 of county 29 and township 4 of county 17). 


Region 2: City of St. Catherines and the Town of Lincoln in the Niagara 


Regional Municipality. (Townships 5 and 9 of county 29 Ve 


Region 3: Town of Pelham and the Town of Thorold in the Niagara Regional Muni- 


cipality. (Townships 11 and 12 of county 29). 


Region 4: City of Niagara Falls and the Town of Niagara-on-the-Lake in the 
Niagara Regional Municipality. (Townships 3 and 10 of county 29). 


Due to the increasing demand of crop production estimates by geographic area, 
an independent sample of farms was drawn in each of the four regions. An 
attempt was made to allocate the resources (i.e. number of farms sampled) 
optimally between regions. However, due to the unusually small population 
size in some regions (see Table 1) a compromise between proportional alloca- 
tion, optimal allocation and a rule of "minimum of 2 farms per region per 
replicate" was made. The latter rule was deemed appropriate in order to dimi- 
nish the possibility of complete nonresponse in a given replicate (as could be 
the case if only one farm per replicate was selected). The number of trees/ 
vines in each farm was used as a measure-of-size variable for the purposes of 
allocation as well as for pps selection in the first and second stages. Pre- 
vious results [6] indicate that other proxy variables (such as area under 


cultivation) are likely to be no more efficient than the tree-count variable. 
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TABLE 1: Population and Sample Sizes for the Tender 


Fruit Surveys of Commercial Growers by Region 


REGION 1 REGION 2 REGION 3 REGION 4 TOTAL 
POPL'N SAMPLE|POPL'N SAMPLE|POPL'N SAMPLE|POPL'N SAMPLE] POPL 'N SAMPLE | 
| PEACH 15 4 | 198 22 | AS 4 | 195 30 | 423 60 | 
SOUR 
JeHERRY | 20 4 | 55 20 | 30 20 | 40 2 | 145 56 
| GRAPE | 67 4 | 7015 32 46 4 | 164 22 | Oey. 62 


er eer ee aes A Font! FA a ee 
2.3 First Stage Design 


Within each region, for each Survey, two independent replicates of farms were 
Selected systematically (in order to obtainm a representative sample) with 
probability proportional to the total number of trees/vines on the holding as 
of the 1981 Censuses. The total sample sizes for the two replicates are dis- 
played in Table 1 by region. Since the two replicates are selected indepen- 
dently and since large farms are more likely to be selected in the sample, it 
is to be expected that a certain amount of overlap between replicates will 
exist. In fact, some farms are so large, that not only are they guaranteed to 
be in the sample, but they can appear more than once in the same replicate 
[4]). Each such appearance is treated as a separate event and one 
orchard/vineyard is selected without replacement every time the farm is selec- 
ted. The actual number of distinct farms in the sample is therefore decreased 


as indicated in Table 2. 
TABLE 2: Total Number of Distinct Farms in the Sample by Region 


| REGION 1 REGION 2 REGION 3 REGION 4 TOTAL | 


| PEACH | 3 | 22 | 3 | 27 | 55 
JeHERRY 4 | 17 | 15 | 10 | 46 | 
| GRAPE | 4 | 30 4 | 20 | 58 
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2..4 Second Stage Design 


From the second stage on, the sampling strategies involve some field opera- 
tion. Once an initial contact with the farmer is made (in the spring of 1983) 
it is imperative that every effort be made to obtain the respondent's co- 
operation. It is as this time that the farmer will be requested to aid the 
enumerator in listing all orchards and establishing the current size (i.e. 
number of trees) of each for the peach and sour cherry surveys. For the grape 
survey a similar listing must be prepared for each of the three varieties of 
interest: Concord, DeChaunac and "Other". (Note that since some varieties 
are grown together, one vineyard can appear on several of the lists. However, 
its size for a particular variety listing would be measured by the number of 


vines of that variety only). 


On each holding, for the peach and sour cherry surveys, one orchard will be 
selected with probability proportional to size. For the grape survey one 
vineyard will be selected independently for each of the three varieties 
actually grown on that holding, again with probability proportional to size. 
It cannot be overemphasized that these procedures must be followed faithfully 
in order not to jeopardize the validity of the estimates. Selection proce- 
dures should be monitored carefully to ensure there is no bias in the selec- 
tion towards small orchards or single variety vineyards, which admittedly 


would be easier to enumerate. 


To avoid an overlap of orchards/vineyards on farms that are selected in both 
replicates or more than once in the same replicate, all orchards/vineyards for 
a given holding are to be selected at the same time, using a pps systematic 
sampling method. The assignment to replicates is to be performed at random 
after this selection. (Note that for the grape survey, on farms which appear 


in both replicates, two vineyards of each variety grown are to be selected). 
2.5 Third Stage Design 
Once an orchard/vineyard is selected, its current count of producing trees/ 


vines is determined and a simple random sample of four producing trees/five 


producing vines is_ selected without replacement. The  trees/vines 
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are marked for future identification since the same units are enumerated from 
year to year. (In subsequent years, if a sampled tree/vine has been des- 
troyed, pulled up or has died, a replacement tree/vine is selected and enume- 
rated. However, it does not contribute to the estimate until its second year 
in the sample). Also, each year the producing tree/vine count of selected or- 
chards/vineyards is reestablished in order that the industry's growth (decli- 
ne) can be monitored. (Note that in the grape survey this implies that only 


vines of the particular variety sampled are to be counted in each vineyard). 
2.6 Fourth Stage Design 

This stage exists only for the sour cherry and grape surveys. 

2.6.1 The Sour Cherry Survey 


It is operationally impossible to count all sour cherries on a selected tree. 
To estimate the total marketable fruit count, a sample limb (or limbs) is se- 
lected with probability proportional to the cross-sectional area of the limb. 
A method of selecting a limb in this way is described by Jessen L Save It con- 
sists of selecting a limb at the initial (or primary) branching point of the 
trunk with probability proportional to the cross-sectional area and following 
the selected limb to the next branching point. This is repeated until the 
cross-sectional area of a subsequently selected limb is within five to fifteen 
percent of the primary limbs cumulative cross-sectional area total. As it is 
not always possible to select one such limb, in some instances two limbs will 
have to be enumerated. The selected limb(s) on each sample tree are then mar- 
ked for future identification since the same limbs are enumerated from year to 


year. 
2.6.2 The Grape Survey 


As for sour cherries, it is equally impossible to count all marketable berries 
on a sample vine. Thus to estimate this total, the number of bunches of gra- 
pes (i.e. those clusters containing more than five berries) is counted and 5 


bunches are selected at random without replacement in order to be enumerated. 
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As with the other surveys, the vines are marked and are to be visited the 


following year. 


3. DATA COLLECTION 


The actual enumeration will be performed roughly four weeks before harvest 
each year. It is of great importance that the selected sample vines, trees 
and limbs as well as the orchards and vineyards be well identified in order to 
enable the enumerators to complete their job in the short time available. The 
enumerators will be required to count all marketable fruit (i.e. excluding 
culls which are immature or damaged fruit that will not be harvested) on the 
sample peach trees and the selected cherry limbs. The fruit on the entire 
peach tree is counted primarily due to the fact that the fruit tends to be 
distributed much more unevenly on a peach tree than on a sour cherry tree 


ae precluding the possibility of merely enumerating sample limbs. 


For the grape survey, all berries on the five selected bunches are to be coun- 
ted, excluding culls. Since most bunches are very tightly packed, this will, 
in most cases, involve picking the fruit. For this reason and due to time 


constraints, it is impossible to enumerate the entire sample vine. 


4. REPLACEMENTS 


Even though every attempt will be made to return to the same trees, limbs or 
vines in the following years, there arise cases when this is impossible. (For 
example, branches are sawn off, trees or vines are pulled up or are otherwise 
destroyed). If a sour cherry tree limb was sawn off, an attempt will be made 
to select another limb on the same tree using the same procedures as before. 
In the event that this is not possible, then just as in the case of peach 
trees and grape vines, a new tree/vine will be selected at random in the same 
orchard/vineyard. In the case that the whole orchard/vineyard has been des- 
troyed a new orchard/vineyard will be selected on the same holding using the 


same procedures as described in Section 2.4. In all these cases, the newly 
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selected sample limbs, trees, vines, orchards or vineyards will be enumerated, 
however they will will not contribute to the estimate until the following 


year's data is collected, as only matched observations are considered. 


For those hopefully rare, cases where the farmer has ceased to grow the fruit 
of interest entirely or where the initial contact resulted in a refusal, a 
third "replicate" of much smaller size was selected without replacement for 
each of the surveys. The procedures for selecting the orchard/vineyard and 
the sample of trees, limbs and vines for each replacement farm are the same as 
those described above. The limbs, trees and vines will be enumerated every 
year but will contribute to the estimate only when it is necessary to rotate 
one of them into the sample. The sizes of the replacement sample are indica- 


ted in Table 3 by region. 


Table 3: Sample Sizes of the Replacement Sample 
for the Tender Fruit Survey by Region 


| REGION 1/ REGION 2|REGION 3| REGION 4| TOTAL | 


| 
| PEACH | 1 | Z | 1 | 3 | 7 | 


pert talcaredl meee jamie 
| GRAPE | 1 | 5 | 1 | ze vi 
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5. ESTIMATION FORMULAE 


5.1 Estimates of Fruit Count per Tree/Vine 


Denote by y, the total number of marketable fruit on a tree (vine) tT. 
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Then for the peach survey, ye. is estimated by yes the total number of 
marketable peaches counted on a sample tree t. For the sour cherry survey, 


Ve. is estimated by 
Yr = Yre/Pg (Selvet) 


where nm is the total number of marketable sour cherries counted on the 


sample limb(s) 2 of the selected sample tree 17; 


and Pe is the probability of selecting the sample limb(s) &. Finally 


for the grape survey, Vie is estimated by 


n Ne Noa 
Yr = — a Ad) rss) 
Nr R= 


where N is the total number of bunches of grapes on the sample vine T; 


nes is the number of bunches of grapes that were enumerated on 


the sample vine T (typically n= Des 


and is the number of berries on a bunch & of the sample vine T. 


Y rg 
5.2 Regional Estimates of Fruit Count by Replicate 


Denote by on the estimated total number of marketable fruit in replicate 


r of region (area) a in the current year. For the grape survey, 


“a 3 
Yar = = Yary (Seca) 
Vii | 
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where y is the estimated total grape count of variety vin 


replicate r of region a. 


For the purpose of uniformity of the following formulae, for the sour cherry 
and peach surveys = and sb oe can be used interchangeably, since 
there is only one variety of sour cherries and peaches to be estimated. (In 
other words, the subscript v can be ignored for the sour cherry and peach 


surveys). Then a can be estimated by 


81 83 CG 
a 1 Marv Marvfé Na Narvf Narvfb M"arvfb « 
Yarv = y Z pee Sarat y Yarvfbt (5.2.2) 
NAry f=1 b=1 Narf Natv fb Narvfb t=1 


the current year's estimated total number of marketable fruit 


with Yarvepr 
on the tree/vine t (variety v) in orchard/vineyard b, on farm 


f, in replicate r, of region a (i.e. eye 


Cee es the number of trees/vines (of variety v) sampled in the 
current year in orchard/vineyard b, of farm f, in replicate 


T,, Of region a (typically n = 4 for the sour cherry 


arvfb 
and peach surveys and 5 for the grape survey ) 3 


Dove the number of  orchards/vineyards (sampled for variety 
v in the current year) on farm f, in replicate r of region a 


(typically n = 1 except for duplicates, i.e. large 


arvf 
farms selected more than once in the same replicate); 


n = the total number of distinct farms on which variety v 


was sampled in the current year) in replicate r, of region a; 


a = the total number of  orchards/vineyards (sampled for 


variety v in the current year) in replicate r of region a 


and 


C 
NaevrD 


8 
Neev ee 


83 
Nava 


81 
Natt 


N81 
a 


| PEACH 


| SOUR 


| CHERRY 
| GRAPE 
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n 
(i.e. Narv = ae Narvf)3 


= the total current count of producing trees/vines (of 
variety v) in orchard/vineyard b, on farm f, in replicate 


LT, of region.a; 


= the total count of trees/vines (of variety v), in or- 
chard /vineyard b, on farm f, in replicate r of region a 


as of the 1982/1983 mapping operation; 


= the total count of trees/vines (of variety v), on 
farm f, in replicate r, of region a as of the 1983 listing 


operation; 


= the 1981 Census count of total trees/vines (all 
varieties) on farm f replicate r, of region a (supplied 


with the sample listing) 


= the 1981 Census count of all trees/vines in region a (as 
per Table 4) 


Table 4: 1981 Census Counts of Trees/Vines (nb! 


for the Tender Fruit Surveys by Region 
pestige: Sipe Shs fi Spro fare Sif a6 Toe se, SMIsT =| ~~ 
| REGION 1 | REGION 2 | REGION 3 | REGION 4 TOTAL 


13,094 | 389,157 | AO 27 | | 411,697 | 824,219 | 
| | | | | | 
| 9,496 | 50,888 55,449 32,536 148, 369 | 
[1,142,067 {5,660,008 | 946,509 [3,975,202 (11,729, 786 | 
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9.3 Regional Estimates of Ratio of Change and their Precision 
—$—$————$$—— ee tien their rrecision 


The ratio of change in production in region a from the previous year, denoted 


by Ra? is estimated by 


where ths = the estimated total marketable fruit count (of peaches, 
sour cherries, grapes or grapes of variety v) in region a in 


the current year is given by 


2. a 
5, CRY. BB edie) 


in the case of peaches, sour cherries, and total grapes of all 


varieties, and by 


a “a 1 2 a 
a = ey = eny Le (539)..5') 
=i 
in the case of grapes by variety, and where Xa? X ap 
ee are the corresponding previous year's estimates. 


(Note that the subscript v can now be dropped as all estimates are treated in 
the same manner, be it peach, sour cherry, total grape or grapes by variety 


estimates.) 


Define the variances of Ms and x, and their covariance by 


NO SOR eae ie mage (5.3.4) 
° 5 Aaa 2 
VX (Xo - X45) /4 = DA 4 65 45) 


Cov(Y,5X,) = (Yat an) (Xa4-X a9) /4 =e Die a 
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where the numeric subscripts refer to the replicate number. Then the varian- 


ce, VCRE)3 of the ratio of change estimate, Ras can be estimated by nba 


A es 1 A A AE oK a? a 
V(Ra) ors Vie an OR AE COVC Ya Xe ein amavis} 
X 
a 
_ { Pya _ Sya Pxa 32 Cir) 
Sxa 52 
xa 
with ya = Va + Y 52 
Diva = Yat 7 Yaar ete (5.3.8) 


The coefficient of variation of R. is then estimated by 


i. {W(R,) }? 
Viner. Ss ex 00% Coo) 


R 
a 


5.4 Regional Estimates of Total Fruit Production and their Precision 


Denoting by x! the previous year's actual yield (tonnage) in region a and 


by ve the corresponding current year estimate, then ve is given by 
yl = xTR 4.1) 
a aa 


with its coefficient of variation estimated by 


aes a {v(R_) }2 74 

Cv(Y_) =a a x 100% = CV(R ) C5, & 2) 
i a 
Xr oR 
a a 


5.5 Estimates of Total Fruit Production and their Precision 


Denote by Yl the estimated total fruit production over all four regions in the 


current year. Then Yl is given by, 
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Y year 


with a coefficient of variation estimated by 


Cy tues paper deseribes “toe span (5.522) 


<> 


6. SUMMARY 


The first enumeration will take place in 1983, however, it will not be until 
the summer of 1984 that the first estimates from the redesigned survey will be 
produced. For this reason, it will be necessary to conduct both the old and 


the new surveys in 1983 so that estimates will be available for that year. 


Even though the survey was designed to be self-weighting, it is only approxi- 
mately so due to the time differences of the 1981 Censuses, the 1983 initial 
listing operation and the subsequent enumerations. The estimation formulae 
presented in the previous section take these time differences into account. 
However, due to the appealing simplicity of the self-weighting estimate, an 
investigation of its performance has been proposed once the data becomes 


available. 
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A TIMELY AND ACCURATE POTATO ACREAGE ESTIMATE FROM LANDSAT : 
RESULTS OF A DEMONSTRATION! 


R.A. Ryerson, J.-L. Tambay, R.J. Brown 
and 


L.A. Murphy, B. McLaughlin? 


This paper describes the procedures used and results” of a “joint 
Canada Centre for Remote Sensing (CCRS) and Statistics Canada pro- 
ject to provide a timely potato acreage estimate for New 
Brunswick, a major potato producing province in Canada. The pro- 
ject has demonstrated that satellite imagery combined with more 
traditional potato area estimation procedures can lower respondent 
burden, produce timely crop distribution maps and produce reliable 
estimates for subregions. 


1. INTRODUCTION 


Earlier satellite remote sensing work in St. John Valley, New Brunswick by the 
Canada Centre for Remote Sensing (CCRS) and the New Brunswick Department of 
Agriculture has proved that both an accurate and low cost estimate of potato 
crop area could be made using satellite data (Mosher et al. 1978; Ryerson et 
al. 1979; Ryerson et al. 1980). Interest in this and other CCRS work on 
rapeseed-canola (Brown et al. 1980) resulted in Statistics Canada initiating a 
real-time demonstration using data from Landsat satellite in the 1980 crop 
year. Statistics Canada, the Federal agency fresponsible for crop data 
collection, wanted to compare traditional and satellite-derived estimates of 
crop area in the same region. Potatoes were selected as the subject crop, and 


the St. John Valley was the region. 


Originally presented at the fifteenth International Symposium on Remote 
Sensing of Environment, Ann Harbor, MI, May 1981. 


R.A. Ryerson and J. Brown, Canada Centre for Remote sensing (CCRS)5) EXM.R., 
J.-L. Tambay, Business Survey Methods Division (this work was done while 
the author was in the Institutional and Agriculture Survey Methods 
Division), Statistics Canada, L.A. Murphy, Agriculture Statistics Division, 
Statistics Canada, and B. McLaughlin, Agriculture Statistics Division, 
Regional office at Truro, Nova Scotia, Statistics Canada. 
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The particular benefits of satellite remote semsing which are of interest to 
Statistics Canada are improving accuracy of estimates obtained from their re- 
gular surveys, possibly at more local levels, lowering of respondent burden by 
reducing the number and/or size of questionnaires, and the possibility of pro- 
viding maps of small areas containing speciality crops to better plan sampling 


methods. 


Following a summary of the main results in the next section, the balance of 
this paper outlines the remote sensing methodology used in this project, and 
describes the existing Statistics Canada data collection system, the project 
region, the ground data sample and data collection, the analysis of remotely 


sensed data and the verification and analysis of results. 


2. MAIN RESULTS 


Data collected by satellite were used to produce estimates of the potato area 
in the St. John Valley region of New Brunswick. These estimates, expanded to 
the provincial level, were within two percentage points of the Statistics 
Canada published estimate of 52,000 acres. The published estimate was based 


on the results of three independent surveys in the province. 


The analysis of satellite data was done in real-time (almost instantaneously) 
at CCRS, as much of the work could be initiated prior to data acquisition. 
The Agriculture Enumerative Survey (A.E.S.) area sample provided the ground 
data needed to calibrate the system, and was used to obtain ratio and regres- 
sion estimates which corrected for biases in the satellite classification of 
potato fields. Although the demonstration was not carried out in a production 
environment, the final estimates could have been produced less than two weeks 


after the satellite pass over the test region. 


Problems in the satellite classification were caused by the presence of clouds 
(satellite nonresponse), and by the confusion of potatoes with "similar 


appearing" features on the analysis system. The first problem caused loss of 
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data, and required some imputation. The second problem was partly resolved by 
adjustments to the Classification, and by the use of ratio and regression 


estimators. 


A comparison of interviewer - collected ground data with data collected 
through aerial photographs of sample fields showed that some fields were 
missed by the A.E.S. interviewers, as these were not pequaredifors APELS) 
purposes. This resulted in the aerial photography data being used instead of 
A.E.S. data for the 1980 satellite estimates. The 1981 A.E.S. enumeration 
procedure were changed to accommodate both A.E.S. and remote sensing 


requirements. 


As a result of the success of this demonstration, the experiment was 
repeated in’1981.° °In addition, a similar experiment was undertaken that year 
to estimate the canola acreage in the Peace River District of Alberta and 


British Columbia. 


3. REMOTE SENSING USING THE LANDSAT SATELLITE 


Remote sensing is the sensing or measuring of the characteristics of an object 
from a distance, usually from an aircraft or satellite. When satellite data 
are used, complete coverage of large areas can be provided quickly wate a 
relatively low cost. Possible areas of application include agriculture, 


forestry, land utilisation, ice formations and general map making. 


The United States National Aeronautics and Space Administration Landsat 
satellites provided the satellite coverage for this, and an earlier experiment 
in New Brunswick. Each Landsat satellite orbits the earth 14 times a day in a 
Sun-synchronous orbit (permitting coverage of the earth to be done at the same 
local sun time). Light reflected from the ground is recorded on four narrow 
bands of the spectrum using a Multispectral Scanner (MSS). The data are 


transmitted in Canada to one of two receiving stations in Prince Albert, 
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Saskatchewan and in Shoe Cove, Newfoundland. A point on earth is covered once 
every 18 days by a Landsat satellite (every nine days if two satellites are 


used). 


The CCRS Image Analysis (CIAS) analyses the data, received on standard pro- 
ducts such as Computer Compatible Tapes (CCT's) covering areas of 25,600 squa- 
re kilometers. The smallest units for which image data are defined are called 
picture elements, or pixels. Each pixel carries its own spectral signature, a 
measurement of its reflectance on the four spectral bands. The spectral 
signature will depend on the features present in the pixel (roads, crops, 
etc.), each of which carries its own signature. Crop signature is a function 
of plant structure, type of soil background visible, crop maturity, height, 


and leaf density, among other factors. 


To obtain estimates of crop areas, it is necessary to identify each pixel be- 
longing to a crop of interest. Large known fields of the crop of ginberest are 
located to train the system to identify the crop's signature. All pixels are 
then classified as belonging to the crop or not, based on their spectral 


Signatures. 


Areas for specific regions are obtained by counting the number of pixels insi- 
de the regions that are classified as belonging to the crop. Additional trai- 
ning may be done to cover pixels "missed" in the initial classification, or to 
further separate confusion crops, that is, crops whose spectral signature clo- 


sely resembles that of the crop of interest. 


Accurate ground data are needed, first, to locate large training fields for 
the crop of interest and second, to correct for any biases in the satellite 
classification. These data can be obtained by trained ground enumerators, or 


by using airborne imagery, which is interpreted by image analysts. 
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4. CURRENT STATISTICAL DATA COLLECTION SYSTEM 


Historically, Statistics Canada has used data obtained from annual mail non- 
probability surveys as the primary input into its crop estimation system. 
While these surveys are relatively inexpensive and can be completed quickly, 
they are limited by varying response rates and possible non-representativeness 
of respondents. Probability enumerative Surveys were introduced in the mid- 
1970's to overcome some of these problems. These involve enumeration of a 
random sample of farmers by personal interviews. In 1980, Statistics Canada's 
estimates of potato area in New Brunswick were based on the results of three 
surveys: the Mail Survey, the Objective Potato Yield Survey (0.P.Y.S.), and 


the Agriculture Enumerative Survey (A.E£.S.). 


The Mail Survey questionnaires are mailed out in early June to all farmers 
listed on a Farm Register maintained by the Agriculture Statistics Division. 
Replies are compiled on a county basis and county estimates are derived by 
linking annual changes in reported potato acreages to census potato acreages 
for the county. The county estimates are aggregated to Give provincial esti- 


mates by late June. 


The 0.P.Y.S. is a specialized mail and enumerative survey designed to estimate 
potato area and yields in the potato growing region of New Brunswick. The 
Survey is conducted in mid-July on a random sample of potato farmers selected 


from the Farm Register and potato area estimates are generated by mid-August. 


The A.E.S. is a multi-purpose enumerative survey designed to estimate crop, 
livestock and farm expemse data at provincial levels. The A.E.S. is a multi- 
ple frame survey consisting of a random list sample of farmers selected from 
the Farm Register and a random area sample of segments. Enumerators visit 
the sampled farms in late June and early July. Acreage estimates from the 
Survey are available in early August. Each year about 20% of the segments are 


changed. 
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During the growing season, two potato crop area estimates are published. The 
first, in late June, is based on the mail survey results. The second estima- 
te, in early September, is based on a review of the estimates from all three 
surveys and discussion with provincial authorities. The date of the second 


estimate was the target date for generation of a satellite-derived estimate. 


5. PROJECT REGION 


The area for which an estimate was required is located in the upper St. John 
Valley in New Brunswick. It starts in the south at Woodstock in Carleton 
County and follows the St. John River for 200 kilometers northwest through 


Victoria County to Claire in Madawaska County. 


The region is heavily wooded, of varied, rolling terrain. There are some pro- 
blems related to stoniness and drainage. Within the area are 70,000 hectares 
of improved cropland, of which about 20,000 are usually potatoes. Other major 
crops are grains, hay and processing vegetables such as peas, broccoli and 
brussels sprouts. Parcel sizes range from 0.1 hectare seed plots to 40 hecta- 


re fields. 


6. 1980 GROUND SAMPLE AND DATA COLLECTION 


The area sample for the A.E.S. in New Brunswick was considered a suitable 
vehicle for obtaining ground data for interpreting remote sensing data. This 
sample is selected in two stages. At the first stage, census Enumeration 
Areas (EA's), which had farm headquarters in them in the 1976 census (called 
Census Agricultural EA's), were stratified based on their potato acreages, 
cattle, and pig numbers (1976 census data). Within each stratum, two 
replicated simple random samples of EA's were selected. Each sampled EA was 
segmented into identifiable area units of about 5 to 8 square kilometers using 
maps, and a simple random sample of one in 10 segments was selected per FA. 


A.E.S. enumerators working in the study region were supplied with old enlarged 
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aerial photos (scale 1" - 832') for each sampled segment. The photos were 
obtained from provincial sources. Most of them had been taken in 1976. While 
contacting the farmers operating land inside the segment, they were required 
to show the photograph to the farmer and identify on it all potato and corn? 
fields and note their areas as reported by the farmer. Written instructions 
on procedures to be followed were included in the interviewers’ manual and 


interviewers and supervisors were trained on procedures to be followed. 
7. ANALYSIS OF REMOTELY SENSED DATA 
7.1 Previous Work 


Work in the same region using 1975 data has been reported elsewhere (Mosher et 
al. 1978), and a detailed description of the approach is available (Ryerson et 
al. 1980). In the 1975 work, a test area which contained about twenty percent 
of the province's potato crop was selected for detailed analysis from the 
potato growing region. This was supported by ground data collection for the 


entire test area. 


The 125 square kilometer test area and two sub-areas were located on the 
colour video display screen of the CCRS Image Analysis System (CIAS) 
(Goodenough, 1979). A very simple supervised training scheme was used to 
gather the statistics of pixels in three contiguous potato fields in the form 
of four one-dimensional histograms. A four-dimensional parallelepiped in fea- 
ture space was generated as defined by the limits of each of the histograms 

to serve as a decision boundary. All points within the parallelepiped were 


classified as potatoes, and those outside the region as "other". 


One of the major problems was the proper classification of boundary pixels. 


These present special problems, as they fall on the border of two different 


3 Corn fields were also required to be identified since earlier remote sen- 


sing work indicated that corn was a confusion crop for potatoes (Ryerson et 
ai. 1950) . 
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fields. Their reflectance is a function of the amount of each field within 
the pixel and the reflectance of each cover material in the two fields. To 
attempt to compute the percentage of each cover type present in such pixels is 
generally very complicated. However, it was found that by modifying the ori- 
ginal decision boundary through adding a second parallelepiped formed by trai- 
ning on a number of boundary pixels which appeared to be in potato fields, 
reasonable area estimates could be achieved. Selection of the appropriate 
boundary pixels to be classed as potatoes was on the basis of subjective vi- 
sual interpretation of the display (data from three of the four spectral bands 
were merged to form a colour display, with colours simulating those of a 


colour infrared film). 


Less then four hours of CIAS time were required to perform the area estimate 
for the entire potato belt. Location, display and analysis of the primary and 
sub-test areas required just over one hour. Selection, display and analysis 
of the subsequent five subscenes required two and one-half hours, while loca- 
tion of the New Brunswick border and elimination of data from outside the pro- 


vince required another hour. 


Compared to total area of potato fields interpreted and measured from low al- 
titude aerial photographs taken at the same time, the 1975 satellite estimates 
were 95% accurate (i.e., 95% of the estimated true value), in the sub-area 
containing the training site, 80% accurate in the second sub-area, and 88% ac- 
curate over the whole primary test area. On repeated tests using different 
training fields, the accuracy over the primary test area ranged from 85% to 
97%. The province wide accuracy was 84.5%. Some of the error in the provin- 
cial estimate resulted from the fact that some potatoes are grown outside the 


potato belt. Other factors contributing to the error are discussed below. 


7.2 PROCEDURES TO IMPROVE ESTIMATES 


Although the previous work was successful, potential sources of error were 


identified for applications requiring accuracies greater than say 85%. The 
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major problems arise in the subjective selection of the potato field boundary 
pixels, in the handling of small fields, and in the resolution of difficulties 


with the crops confused with potatoes. 


With regard to confusion crops, ideally one should know the spectral reflec- 
tances of potatoes and all of its confusion crops throughout the growing sea- 
son. With such information, it is possible to specify the phenological window 
during which potatoes can be reliably separated from other crops. Unfortuna- 
tely, such a data set does not exist, although knowledge of the region's crops 
and the cultural practices does provide some general guidelines. In this 
case, based on the field experience of the authors, it was hypothesized that 
the optimum date for separation of potatoes from other crops in this region 
would be from mid-July to mid-August. To test this hypothesis and provide an 
indication of the degree of separability of potatoes from other crops, an ana- 
lysis was performed of a Landsat MSS Computer Compatible Tape (CCT) acquired 
over the St. John Valley on August 8, 1975. Figure 1 shows the Landsat band 
radiance values for potatoes, corn, peas, hay, broccoli, pasture, buckwheat, 
bare soil and grains. It can be seen from this that potatoes are easily sepa- 
rable from the other crops except for the peas, which are usually harvested by 
mid to late August. It would therefore appear that the analysis of data col- 
lected late in the growing season is likely to lead to the separation and 


identification of potatoes. 


The problem of small fields and boundary pixels can be handled by an approach 
which uses available ground information over a limited area to produce a more 
accurate crop area estimate over an extended area. Given the size of the 
potato growing region here, the area of potatoes within each of ten to fifteen 
segments is required. Each segment is of the order of five to eight square 
kilometers in size. The whole area is still classified as well as possible, 
but the subjective boundary pixel class is not produced. The classification 
result for each segment is then used, along with the available ground informa- 
tion (the aerial photograph data in 1980, A.E.S. data for future years) to ob- 


tain a regression relationship which is applied to the entire area estimate to 
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produce a revised estimate (Hanuschak et al. 1979). A ratio estimate, based 
on the total area estimates obtained from satellite and aerial photograph 


data, is also produced. 


7.3 GENERATION OF THE 1980 ESTIMATE USING LANDSAT 


The generation of a satellite-based potato area estimate can be described as a 
three part process: ordering of old and new data, pre-location of A.E.S. seg- 


ment boundaries, and image analysis. 


The Landsat data ordered where Digital Image Correction (DICS) Computer 
Compatible Tapes (CCT's) using the Sin x / x interpolation for geometric cor- 
rection (Guertin et al. 91979) -and» Cal 3° radiometric correction (Ahern and 
Murphy, 1978). Fach CCT covers four 1:50,000 National Topographic System 
(NTS) map sheets with a resampled square pixel of 50M. Fourd, CGE sykare 
required to cover the region. Existing data were ordered for delivery in May 
of 1980, while new data were ordered for the appropriate satellite passes from 
mid-July to mid-August. The ordering process proceeded smoothly for existing 
data, but was complicated for the real-time data by the failure of 
Landsat-III. A Landsat-II pass on August 17 was used to create DICS CCT's 
which were delivered to the analysis facility on August 22, well ahead of 


schecule. 


The pre-location of A.E.S. segments was done in the spring of 1980 using the 
polygon cursor option on the CCRS Image Analysis System (Goodenough, 1977). 
A.E.S. segment boundaries were provided by Statistics Canada on 1:50,000 map 
sheets and on photocopies of the airphoto enlargements given to the A.E.5. 
enumerators. Although some segments had boundaries which were easy to locate 
(streams, forested edges, lakes, etc.), others based on political or census 
boundaries proved very complex. Segments whose boundaries were a combination 
of major roads and/or major rivers could be located, bounded and stored after 
less than five minutes work on the enlarged colour display of 128 x 128 50M 


pixels on a 512 x 512 monitor. The most complex took up to an hour - with an 
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average of 20-25 minutes. Once located, the segment was stored by specific 
DICS pixel coordinates so that it could be overlaid on new data as it arrived 
to locate both training data and inputs for the estimator. Because of the lo- 
cation of some segments on the boundary between two DICS CCT's and other simi- 
lar problems, a number of segment boundaries were not located in the prepara- 
tory phase. Software is now being written to shorten the time required for 
the entire project, especially the pre-location phase. Use of original photo- 
graphs in place of photocopies planned for the 1981 project's new segments (15 


rotated in for 1981), should also shorten the time required. 


Once the 1980 ground data were delivered to the analysis center, potential 
training sites (based on field size) were selected. Several fields were 
selected from one segment in the north of the region (near St. André) while 


several others were selected from a segment in the south (near Hartland). 


Upon receipt of the 1980 satellite data, it became a relatively simple task to 
recall the segment boundaries, overlay them, locate training fields and begin 
classification using methods described in 7.1. In addition to the selected 
training areas, another group of large fields was selected as were areas of 
known potato fields which appeared brighter red on the monitor than those in 
the training set. As classification results were available, crop areas were 
recorded for each 512 x 512 pixel subscene and for each of seventeen Ae Bas « 


segments. 


There were four problems encountered during the classification; one involving 
imputation of crop under scattered cloud and cloud shadow and the other three 
related to confusions. The method of imputation of potatoes under cloud was 
quite straightforward. It was assumed that the percent area of crop under the 
clouds was similar to the area of crop in an adjacent "like-appearing" 


Tegion. A simple formula was used to determine potato area under cloud, PC: 
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where: Py = potato area measured in the cloud free region; Cy = area 
in cloud and shadow in the region and Th = total area in the region. 
These areas were incorporated into the total satellite estimate, no AciksaSre 
segments under cloud were included in the ratio or regression analyses. The 
problems with confusions were, for the most part, solved through careful 
modification of the classification parameters. In one case, an unknown form 
of widely scattered natural regrowth in forest cut-overs was confused. In 
another case, hay fields with regularly spaced piles of stone were similar to 
potatoes. Only a few fields of clover in one segment and some of the fairways 


of a golf course in another remained as confusions after modifying the 


classification. 


The areas calculated for the region are presented and discussed in more detail 


below. 


8. RESULTS AND ANALYSIS OF THE 1980 NEW BRUNSWICK POTATO PROJECT 
8.1 INTRODUCTION 


The analysis of data from the 1980 New Brunswick potato project was done at 
the regional, segment, and field levels. Two types of estimates were obtained 
for the total potato area of the test region in the St. John Valley using 
satellite data and ground verified measures from high altitude aerial photo- 
graph data (the aerial photography data were obtained and analysed by CCRS). 
The estimates and their variances were then compared to other estimates from 
Statistics Canada surveys in New Brunswick. Segment potato acreages reported 
by the A.E.S. and by satellite were then compared to the aerial photograph 
acreages (which are considered here to be closest to the actual values) to 
determine the strength of their relationship at the segment level. This ana- 
lysis was complemented by examining the variation in A.E.S. reporting of field 


acreages for each interviewer's assigned area. 
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8.2 ESTIMATION OF POTATO AREAS USING SATELLITE DATA 
EE Bae 


A ratio and a regression estimate of the total potato area in the test region 
in New Brunswick were obtained using satellite data and the aerial photograph 
segment data. Estimates, along with their variances, were calculated using 
the A.E.S. sample design. The A.E.S. in New Brunswick is a multiple-frame 
stratified replicated two-stage sample of segments, designed to give accurate 
estimates of various items at the provincial level (see sections 4 and 6). 
Since the A.E.S. strata did not coincide with the test region boundaries, the 
technique of post-stratification was used for estimation, treating the EA's as 
a with-replacement-sample. Segments with missing satellite data were not in- 
cluded in the sample, nor was one outlier (see Figure 3). These estimates 
were based on 40 segments. Finally, since A.E.S. enumerators did not always 
collect data on all farms inside the segment (see 8.3), aerial photograph seg- 


ment acreages were used for estimation. 


Let the label x represent the reported satellite potato data and y represent 
aerial photograph data. The ratio and regression estimates of Y, the total 


potato area in the test region were calculated as: 


y # Any Y 

Yratio 7 RX =X, and 
X 

Yreg. = ¥ +8 (x-R), 


3 L Ny OO Mp Mi 
x z 


where Voss is the design estimate of Y; 


= Vite 
Mh; hi 
h=1 ee 
X is the design estimate of X, the total uncorrected satellite potato area in 
the test region. X is obtained by substituting Xhij for Vai j in the 


formula for Y; 
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and are the observed values for the jth selected segment in 


Yhij ‘hij 
the ith selected first-stage unit (EA) from stratum h; 


N and n are the total and sampled number of EA's in stratum h, 


h h 
ASS ame, WP 
Mui and mi, are the total and sampled number of second-stage units 


(segments) in the ith selected EA of stratum h; 


R is an estimate of Y/X; and, 


B = cov (¥,X)/ var (X) is the linear regression coefficient. 


The variance estimates of leva and rede are given by: 


v(Ypatio) = v(Y) - 2R cov (Vox) eek VG). and 
v( YReg..) = y(Y),.= 20 cov, (¥,%) + Bo VX), 


ae Hane BAe DOLORS Ae 


4 hij Oh izt hi j=1 


i 5 N, 2 Nhe M.. Mhi Nh OM.) hi 2 
where v(Y) = | hij 
aby 


pee OU sat 
h=1 MH(Mh-1) G=1 L Mi 


and v(X) is calculated by substituting Xha j for yaa TW ke 


It can be seen that B = cov (Y,X)/v(X) is the value of B which minimizes 


vYpeg.* 
Table 1 shows the ratio and regression estimates, along with other Statistics 
Canada estimates of the potato area in New Brunswick. The ratio and regres- 
sion estimates, pro-rated to the provincial level, are both very close to the 
Statistics Canada published figure of 52,000 acres. There is very little dif- 
ference between the two estimates, and they both have the same coefficient of 
variation. In order to give an idea of the gain in efficiency obtained by the 


ratio estimate, the variance of the ratio estimate was as low as one-fifth of 


that of Ye design estimate of Y based on the post-stratified design (area sam- | 


ple only). It may be noted that the C.V.'s of the ratio and regression esti- 
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mates are of the same order as that of the A.E.S. multiple-frame estimate, al- 


though the latter is based on a larger sample size. 


8.3 COMPARISON OF DATA AT THE SEGMENT AND FIELD LEVEL 
$e LEVEL 


Figures 2 and 3 show plots of segment potato acreages, as reported by the 
A.E.S. enumerators and by satellite, against aerial photograph acreages. Not 
all sample segments were used in the analysis. Satellite data were missing 
for 16 segments due to cloud cover and image location. Eight A.E.S. segments, 
containing survey non-respondents and very large farms whose field data were 
not to be collected by the enumerators, were not used. The two outliers in 


the plots are not used in the calculations here and in Section’ 8.2% 


The plots show a strong linear relationship between A.E.S. and aerial photo- 
graph data, as well as between satellite and aerial photograph data, at the 
segment level, with correlations* of .991 and - 968 respectively>. There is a 
tendency for both the A.E.S. enumerators and the satellite classification to 
underestimate the acreages. This is less marked for the A.E.S. Causes of 
discrepancies of satellite acreages were explained in Section 7. Some seg- 
ments with little or no potatoes were over-estimated because of confusion 
crops. (The satellite outlier had confused a large hay field with rocks for a 
potato field - this error could have been removed by modifying the classifica- 
tion). One major cause of A.E.S. underestimation was that the interviewers 
did not enumerate some farms in the segment because of the multiple-frame pro- 
cedures (these include farms which appeared on the list frame as well as farms 
that had land in more than one sample segment). Specific instructions have 
been written up in the 1981 A.E.S. field procedures to ensure that all farms 
in the segment are enumerated for their potato acreages next year when the 


test is to be repeated. This is expected to bring the A.E.S. reported acre- 


+ The correlations were estimated using the sample design weights. 


A plot of satellite data against the A.E.S. data looked very similar to 


Figure 3, with a correlation of .957. 
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ages closer to the aerial photograph acreages. The strong relationship 
between the A.E.S. and the aerial photograph data for the sampled segments is 
encouraging and supports making adjustments to the satellite estimates using 


the data collected by the A.E.S. enumerators. 


Another cause of A.E.S. discrepancy with aerial photograph segment data was 
the mis-reporting of field boundaries and field sizes. A.E.S. field acreages 
were obtained by interviewing farm operators, and thus, were frequently repor- 
ted in multiple of five acres. Plots of A.E.S. reported field acreages 
against aerial photograph acreages by interviewer assignment areas indicate 
that there may be a difference in field reporting between the assignment 
areas. This could be caused by the interviewers themselves, but also by other 
factors such as geographic location and structures of fields within sample 
segments, or by variation in reporting errors. More variation was observed in 
the region east of the St. John River, where average field sizes were general- 
ly larger. The relationship between the A.E.S. data and aerial photogrpah 


data was stronger at the segment level than at the field level. 


9. SUMMARY AND CONCLUSION 


Satellite data were used in 1980 to generate a highly accurate potato area es- 
timate for the three major potato producing counties of New Brunswick. 
Through the project described here, refinements have been identified for field 
procedures and analysis methods which should provide even more accurate esti- 
mates of crop area with satellite data, reduce respondent burden, and provide 


detailed spatial information previously available only in Census years. 
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TABLE 1 


SURVEY ESTIMATES OF THE TOTAL POTATO AREA FOR THE TEST REGION 
AND FOR THE PROVINCE OF NEW BRUNSWICK? 


Survey and/or Test Region New Brunswick Coefficient 
Estimate Estimate Estimate of Variation (%) 
ic acerca ic la cheat a nt a eR ER PE BLE NIT TO I ae OE os! 
(Acres) (Acres) 
Satellite 
Uncorrected 47,354 N/A N/A 
Ratio 49,504 Siuneyat PP) 
Regression 49,115 A ey lal Pee) 
Mail Survey N/A 50, 800 N/A 
Aste. Ss 
Multiple Frame N/A 53,854 De 
OPP N.S 47,203 495129 459 
Statistics 
Canada 
Published 
(Sept. 5) N/A 52 , 000 N/A 


1 The test region, composed of the counties of Carleton, Madawaska, and 


Victoria, accounts for about 96.08% of the total potato area of New 
Brunswick. 


= 187 < 


it 


8 8 at My 


BAND RADIANCE 
(DIGITAL COUNTS) 


LANDSAT MSS BAND 


FIGURE 1 Comparison of Satellite Band Radiances for Various Crops. 


Landsat MSS band radiance (August 8, 1975) for grains (1), buckwheat (2), 
pasture (3), hay (4), corn (5), peas (6), potatoes (7), broccoli (8), bare 
soil (9), weeds (10). The bands are numbered 4 to 7. The relative distances 
between the digital counts in each column indicate how separable the crops are 


for the given band. 


- 138 - 


AES 
ACREAGE FIGURE 2 


400 


300 


200 


190 


0 100 200 300 400 
AERIAL PHO'LOGRAPH ACREAGE 


SATELLITE 
ACREAGE FIGURE 3 


0) 100 200 300 400 
AERIAL PHOTOGRAPH ACREAGE 


FIGURES 2 and 3 - Plots of reported A.E.S. and satellite potato acreages vs. 
low altitude aerial photograph acreages for sampled segments 
in New Brunswick. The circled observations represent out- 
liers. 

Legend: .=1 obs., f=2, obs., *~23 0bs., M211 oObs.,.eand 
A=15 obs. 
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SAMPLING ON TWO OCCASIONS WITH PPSWOR 


G.H. Choudhry and Jack £. Graham} 


A theory of sampling on two occasions with unequal probabilities and 
without replacement is presented. Fellegi's (1963) method, which 
yields the same selection probabilities for a given unit on each 
occasion, is used to select the units for the rotation sample. The 
variances of composite estimators of the population total on the 
Second occasion are developed. Numerical results are presented for 
small sample sizes and efficiency comparisons are made with a 
competing strategy. 


1. INTRODUCTION 


In surveys of a repetitive nature there are advantages to using a partial 
replacement sampling scheme both from the point of view of efficiency of 
estimation as well as reduction of respondent's burden. Essentially, after 
each sampling occasion a fraction of the units is rotated out of the sample 
and is replaced by a fresh subsample from the population. The literature 
abounds with discussions of sampling procedures and estimators when sampling 
on two or more occasions with equal probability. But of particular practical 
importance is the situation where units are selected on a given occasion with 
unequal probabilities. Thus, consider a finite population of N_ units 


nese. NT and two sampling occasions 1 (the previous occasion) and 2 (the 


current occasion). Let Yaa and Yo. denote the values of a 
characteristic y borne by the i-th unit on occasions 1 and 2 and let Y, 
and Yo denote the respective population totals. A size measure x5 is 


known for each of the units in the population. 


Raj (1965) considered the following pps (probabilities proportional to size) 
sampling scheme: on the first occasion a sample s of size n is selected with 


probabilities P; proportional to the Xi values and with replacement 


G.H. Choudhry, Census & Household Survey Methods Division, Statistics 
Canada and Jack E. Graham, Carleton University. 
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(wr). On the second occasion a simple random sample S, of m units is 
selected from s without replacement (wor) and an independent pps sample So 
of u = n-m is selected wr from the entire population. Y, and Yo are 


then respectively estimated by 


| cicrd ble cee 
sheep y,,/(np;) and YOR Selle Dk 


*\A) 
1 : eat Vos 


2u 
where You = 2. ¥2a/(upa)s, Y2 = ¥4 + Bay Ying y1i)/(mpi), 
82 1 


and Q* is a weight, 0<Q*<1. 


The minimum variance of YOR was developed under the assumption that 


N 
= 2 
Lara fa) = aye) pe Cy, 47 Pa ete) 


is thevsamestorac- = landiZ. 


The problem of sampling with ppswor on one occasion has attracted considerable 
attention in the literature. A major difficulty lies in the specification of 
feasible procedures which lead to specified probabilities at each and every 
draw. Fellegi (1963) has proposed a method such that the probability that 
unit i is selected on each of the n draws is P; by determining n-1 sets of 
"working probabilities". This is an extremely desirable feature for rotating 
samples where it is essential that the usual pps estimator be unbiased for 
Yo3 this will not be true for any partial replacement design that does not 
feature a constant P; for each of the n draws. The calculations inherent 
in Fellegi's scheme have, until recently, been prohibitive for n > Ze 
Choudhry (1981) has developed an iterative procedure for implementing 
Fellegi's scheme and prepared a computer program to evaluate the working 
probabilities when n <5. Although the convergence is fast in terms of the 


number of iterations, the amount of computation increases at the rate nN". 
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The program also computes the joint probabilities for the inclusion of both 


units i and j in the sample for variance calculation purposes. 


Rao, Hartley and Cochran (1962) devised the "random group method" for 
selecting a sample with ppswor. The population of N units is split into n 
groups of sizes Na» No» fates No where 2 Ni. = N and a sample of 
one unit is drawn independently from each group with probabilities 
proportional to the p;'S- Ghangurde and Rao (1969) extended the random 
group method to sampling on two occasions. For simplicity, the N units were 
split into n groups each of size N/n (assumed to be an integer). On occasion 
1, one unit is drawn from each random group as above, giving a sample s of n 
units. On the second occasion a simple random sample s, of m matched 


1 


units is selected from the n units wor and an independent sample 7) of 


u=n-m units is drawn from the whole population by the method used in 
obtaining s. They form a composite estimator oR of Yo and obtain its 
minimum variance under an optimum choice of the weight Q. The optimum value 
of A = m/n is then determined. The authors remarked that it would likely be 
more efficient to select So from the N-n units in the population that are 


not included in s. 


Chotai (1974) modified the Ghangurde-Rao (G-R) design on the second occasion; 
the n units in s are split at random into m groups of size n/m (assumed to be 
an integer). One unit is selected from each of the m groups with probability 


proportional to Pi» yielding a_ sample Sy° A sample So is obtained 


as in the G-R method. The optimum variance of his composite estimator ee 


is derived, the optimum A determined and relative efficiency comparisons of 


oe with respect to G-R's and Raj's optimal estimators are made. lee 


was found to be always more efficient than tas and, in many cases, . 

as well. A brief discussion of the case when n/m is not an integer is 
provided. It is worth noting that because A is not a continuous function, the 
optimum » should really be determined using integer programming methods. In 
what follows a sampling strategy is developed which often results in greater 


gains in efficiency over ppswor sampling than previously proposed schemes. 
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2. SAMPLING STRATEGY 


2.1 SAMPLING PROCEDURE 


From the population of N units, (1,2,...,N), select a sample of n+u units, 
u <n, draw by draw and without replacement using Fellegi's Method such that 
the probability of selecting the i-th unit at each draw is Pi» 221,.2,128.N, 
Zp, = th: On the first of the two occasions, the first n units are 
observed from the n+u selected; on the second occasion the first u units are 
dropped from the sample and the unused set of u units is rotated into the 
sample. Thus m = n-u units are observed on both occasions. The n units 
observed on the first occasion are referred to as s, those units observed on 
both occasions as Sy (where 8, C s) and the set of unmatched units 
observed only on the second occasion as So. Note that Fellegi's scheme 
guarantees that the selection probabilities for a given unit i are the same on 
each draw and hence the same on both occasions. By restricting his attention 
to a sub-class of non-homogeneous linear model-design unbiased estimators, 
Chaudhuri (1980) has shown that the foregoing sampling scheme yields an 


optimal strategy. This is a further motivation for using Fellegi's method. 


2.2 ESTIMATION THEORY 


In what follows, composite estimators of Yoo the current occasion total, 


are proposed and their variances determined using an indicator variable 


approach. 
Let atte ic 1 Df PUG olan gs BE = tsi s ahe cenit is selected at draw tf, 
Pal WesecstU cand ans 0 otherwise. Since the expectation’ of 
nA is P3» an unbiased estimator of the first occasion population 
total Y, is 

A Jeet N 

\ert= Sogbla Yaeyvor/o% 


1 nereiistior a ed 
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rn nm 1 al N 
Then POC Gk Sed b Toran = Wy 
2 1 m r=u+1 i=1'r i oF or ic 


is an unbiased estimator of the second occasion total Y 


estimator of Yo based on the current observations is 


ie Slee 4 / 
= a 
2 mh r=u+7 isl r i 2i 2, 


A composite estimator of Yo is the weighted sum 
“a Al aA 
Yoo = ay. + (1-Q)Y,, 

mnere 0S Q <1. 


2 rs at K 
The variance of Yoo? Var (Y,,.) = a’Var(Y,) _ (1-0) Var (Y,) + 


20(1-Q)Cov(Y,, Ye is derived by using the following 


the indicator variable rai: 


Mart ae eX pal. J—-ps ), Gia, 23% wayNb Ded} Zeyto oh gNFU) 


2 
Cov(, aj ,¢aj)'= -pz, (r#t), 
Cov (paj yp aj) = -PiPj> (i#j), 


Cov (aj sta; ) SMe taj) - Ppjpj, otherwise, 


2° 


An unbiased 


properties 


of 


where E(.) denotes the expected value with respect to the probability design. 


Now EC a : £3) = PCa ; ne = 1) = PCa: = 1 = 1) 


where P(.) denotes probability. 


9 oa 


Let Ze ey denote summation over all possible ordered (k-2)-tuples of 
——? ’ 


different units Se Ss al ya +o ny LER 9 Kat! included in the sample 


from the first k draws selected from the N-2 units in the set da Dixerrete i= 1 


a jel ees such that the i-th unit is selected at draw r and the 
J-th unit at draw k. There are (N-2)(N-3)...(N-k+1) terms involved in the 


Summation. 
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As in Fellegi (1963), let {p, (2)5 i=1,2,...,N} be the set of "working 
probabilities" for selecting a unit at draw 2, £=1,2,...,n+u. For draws k and 


DP iwithek ters 


E Gags an) nad ; Pamela es ie 
Dolake) (Ra2e ay) =p eet) r-2 
1 i je= Dirgp. ied) 
1 ies | 1g 
GE) (r+1) 
Bi aie 
x x x 
r-1 r-1 
NE OE oes) qa pe Sor epee 1) 
Q=1 72 Q=1 72 i. 
p .(k) 
one tant Ste owen tat lee tyr een ivan init Oe aes Ge 
r-1 k-1 
12°55 POP) Gore ke) yea pss CK) 
Ratment ‘ gare] 72 
Now 
ny 1 4 n N / 1 r n N ( / 
Var (Yo) = > Var} 2 (eine Yai (Pp) Map Verte) | Cee Ai Yo ay 


r y aqiicyns = -)/p; 
Serpe jy Pear pints Yq basis | 


Using the previously cited properties of the indicator variables BE 


it may be verified that 


n N 1 N 2 1 5 
Var ie OF a ¥4;/P3 =a a PiZ44, + ~ Pe P(i,jes)z4524,-¥ : 


1 n N 1 N 
a7 | eg A ioe ez ol 71/P Lig ae 
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P(i, gets, )( \( 
m2 Bal me te 1142] S81 


where 25 = aml: t=1,2 and n-us=m. 


n oN n N 
ee OvL | Fe, Eeepeiy4i/Pi5s Yond gay (yap iy 74) /p; 
1S |p u+1 1 


N 
St pizqi (zi = 243) + £F Plies, Jjesq)z43(Z25 - 245) - Y4(¥2 - Y4)- 
ned mn i#j 


Combining the foregoing 3 terms gives 


Zz 


at N Z2i 2 1 it 
Var(Y9)= ; Pi [ = + (294-2434) - ~)| + ing [ See Zi 21 j 


Pein es,) 
ete 


2P (ies, jes,) 
+ 


Zz Dea =y2 
Also, 
a 1 ? 1 . 7 y2 (2) 
Var (Y,) eo P3255 +— re P(i, jes*)z,.z,. Ss 


where s* is the set of n units observed on the second occasion, and 


Aron P(ies, jes*) Plies}, jes*) 
Be ge 429) 214 )223] 
ey n2 nm 


2a46 


1 : u 2 
+—2 pizoi (zai - — 244) - Y2 (3) 
Mp jal n 


Expressions (1), (2) and (3), when combined, yield var(¥,.)- 
The optimum value of the weight Q which minimizes var (Y,.) is 
Qopt = [Var (¥>)-Cov( Yo ,¥2)]/[(var(¥>)+Var(¥>)-2 Cov(¥>, Yo)]. 
The corresponding minimum variance is 


Var (Y9,)=[Var(Y).Var (Yo )-(Cov(¥>,%2))2]/[Var (Yo )4+Var(Y2)-2 Cov(¥>,Y2)]. 


An alternative composite estimator ve of Yo is 


Voou= O Yo (1-0 J¥iogs 
where 
az n+u N 
iy ee ZA Yo4/fup;)- 
P= tad 


The variance of Y50 is found by combining (1) with 


e: 1 ae 2 
Ne ap mus Diva ed Se EN Co ee) a 
1 Ul genes) 
. e 4 
Cov(Y,, ee == Leo eR ule 6. Le s,) 24579; 
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@s>) SPECIAL CASE 


As a check on the calculations, consider the case of simple random sampling 


without replacement. 


ai 7 : a ae 
Then Y, - NCy, + (Yom - Yam”? where ¥,9¥4, are, respectively, the sample means 
based on all the sampled units and all matched units on the first occasion, 
and Yom is the sample mean based on all matched units on the second occasion. 


A direct evaluation gives 


At 2 1 1 2 { 
Wary jy= No) (S — sie ee lea 


where, €.g., 
S ; Y Y 


This agrees with the result given by (1) with pj = 1/N and P(i,j es) = 
n(n-1)/N(N-1) (i#j). 


a 


Also, under simple random sampling, Yo = Ny, (where Yo is the sample mean 


based on all n sampled units on the second occasion) with variance 
Var (Y )= N(N-n)S2 
2 20° 


An evaluation of Var (Y,) from (2) gives the same result. Finally, 


from either a direct evaluation or from (3). Similarly, Var(¥5.) may also be 
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checked. 


3. NUMERICAL EXAMPLES 


*& 


The composite estimators YotecandinY 
2c Le 


with their optimum Q and Q* values which 


minimize their respective variances are compared in efficiency with the pps 
estimator Yo which is based on the current occasion information only. Because 


closed forms for var(Y,.) and var(¥5.) are not available to permit analytic 


comparisons to be made, small populations of variate values were employed to 
affect these contrasts. (The populations studied were necessarily small, like 
those one might encounter in stratified sampling, since differential effect of 
sampling with and without replacement is evident only when the sampling frac- 
tions are not negligible). Four rotation sampling plans were applied to each 
population: (n,m) = (2,1), (3,2), (3,1) and (4,3) where n is the number of 
units in the sample on each occasion and m is the number of units in the sam- 
ple common between the two occasions. Two of the populations are given in 
Murthy (1967) where his single population of 34 villages was subdivided into 
two populations of sizes 16 and 17 (one outlier unit being discarded). The 
size measure characteristic is x = cultivated acreage in 1961 with yy and 
Yo being the acreage under wheat in 1963 and 1964 respectively. A third 
population is a set of 14 farms in the province of Saskatchewan with x = 1980 
farm acreage and Y and Yo the 1980 and 1981 cropland acreages respec- 
tively. Two additional real data sets relating to populations of sizes 15 and 


16 respectively are also analyzed. 


fable 1 reports F them relative: ef fierencies “of, V2 and Yo with respect 


to ie for each of these 5 populations and 4 sampling plans. A crucial pa- 


rameter in each comparison is’ the correlation e, between Z4,5 =? 


in Pe ee re 
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N 
flab adonelaiiels 
oe 
N 2 2 N 2 2 
EP AP Yj a Pizope ring 
=| i= 


The populations studied yielded p, values ranging from 0.940 to 0.213. 


The optimum Q and Q* values are also cited. An investigation of the 
efficiencies of ou and ioe under non-optimum choice of Q and OFS 21S 
planned. 


We note the following from these empirical studies: (1) The optimum Q values 


tend to be larger when p, is large and as pe, decreases, the optimum Q 
tends to decrease in both Y 


Y 


oC and ie (2) The optimum Q value for 
pe always exceeds’ that for Ne: (ama As p, decreases, the 


efficiency of Yod with respect to ¥ decreases (as _ expected), 


approaching unity as a lower bound under an optimum choice of Q. On the other 


hand, no such distinct behaviour for Y¥* is evident since Var (¥5) is 


Ze 
not a monotone function of Po For small p, values, small efficiency 
gains and losses relative to ts are both recorded. (4) oe is more 
efficient than y for larger fe) values whereas is more 
Ze z Ze 
efficient than ene for smaller p, values. (So) Vif) =. n/n is small, 


e.g-, the (3,1) plan, then large efficiency gains using oe in preference 


to iG result for large p's. For smaller p, values, the (43) 


plan yields the largest gains using oa the other three schemes give 


about the same gains. 
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It is worth remarking that even if one has a good correlation between the 


y1i. and yj values, composite estimation using Voc can still lead 
to efficiency losses compared to the use of the pps estimator i based on 
current occasion data only. The critical factor is the correlation e, 
between the z,. and the z,. values. One cannot lose using Y 
11 Zi. 2c 


under an optimum choice of Q for Vey. = te with Q =.8. 


Table 2 provides’ the relative efficiencies of Oe and Ae in the 


ppswor design with the estimator oe used by Raj (1965) in his ppswr 
design described earlier. For more valid comparisons, it was not assumed that 
V (y,) was the same for occasions t = 1 and 2; the optimum Q* values 


pps 
for the given (n,m) combinations were utilized. In all cases, as expected, 


the estimators es and Yo. in the wor design are more efficient than 


im Raj s, design. As n increases for a given Pos the efficiency 


ye 

2R 
gain using the wor strategy increases. Finally, we note that Raj realized 
efficiency gains compared with no matching only when Pye = 0.5 whereas 
efficiency gains always resulted using Ce for any le) in the wor 


situation. 
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TABLE 1 


Efficiencies of composite wor estimators relative to ppswor estimator 


Population 


Mur thy 
set 1 


Murthy 


set 2 


Acreages 


Data set 1 


Data set 2 


14 


15 


16 


(n,m) 


OPN) 
(352) 
St) 
are 52) 


(25.1) 
C2) 
GW) 
(4,3) 


(2,1) 
592) 
(digit) 
(493) 


C25) 
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(sy) 
(4,3) 


Waa) 
(3,2) 
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0.867 
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TABLE 2 


Efficiencies of composite wor estimators relative to Raj's composite estimator 


amines eo a ee 


: “x e Se: Ae 
Population N (m,m) 0, Var (Yop )/Var(Y,,) Var (Yop)/Var(Y,) 
a ee 

Murthy aby CA 0.940 1.038 13146 
set 1 r2) Vez 1.252 
Goel) 12aS 1.320 
(4,3) 1.248 1.375 
Murthy 16 2D) 0.867 1.001 1h 
set 2 2) 1.106 1.246 
(3,1) An30 1.287 
a) 1.309 1.380 
Acreages 14 (251) 0.546 1.244 1.065 
(3.2) 1.330 1.151 
Gr) 1.351 4 as GY 
(4,3) 1.507 Mea 
Data set 1 15 (O51 0.392 1.095 1.089 
G22) F197 1.192 
(3,1) 1.200 1.192 
(4,3) Te522 hei, 
Data set 2 16 C254) O.213 16197 1.068 
(352) 1.293 1-152 
(eA) 1.280 1.154 
(Chase 1.589 1.254 
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